Linux 5.15-rc2
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmFH1aYeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG8OkH/R6C6xduJ7I/LkUK S3bATqOwAnpR8Z/1cbNIj7NG7wFEiRbOS0xgnxteUgo34xGVWUoRgdM2wvi4mnst fvWsM/8765g4ok5WtmxZ+ZeClVFo2eGWyZqfsXMlisPNwswceSPOXDPMcEKlROZA CGFqCh0hmma3+waAKf85ZfFR8i/gndbI574MKq00OPNr02Z8GSqQMJSaCzNbGspc AMpdXezEG6CnwRUnlmYZoXOBP9+/ItAMZ6RTPulitJKT3PmuHcs9Pnu6qqbW2X9B fRW3tQNE5Q4oLHPb7H9m13jMNwuIP4DI3b3qnUSjl0g4OKBX0q/Uw1UfntC5Ejxn 9d7jvZ8= =6RN4 -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEreZoqmdXGLWf4p/qJNaLcl1Uh9AFAmFJ6ukACgkQJNaLcl1U h9AwcAf/W8ESV6W8ZzsgIaI+waCk5xB55g1RdF6GzXlLKq30Va/i/12Lgc0xmUvL FEfQ9h63oBLB2KR3lUmztofCm8bN9bBTCYGJUnJh/z/vfXN8Av1TJdfGUBIbjcCE UNS4Fr6oE2PGDzleJNnE23GULpOYWg4ITzTC5TVkTlXkuUqmx2EZjQGkpBWz/Y3z 6vpk1wm9XJZxbU4hh3eA3pQkDbSdVHSAZnlEZudxJyuOj0JSJHLgqwDIkO7spy/v V/WQIsfeqEkFPfCkPciodSHpl1XMxU5qdVC2wo11skRhHOHLwlVrOn2TcpvY7tq7 qEstYJfaLGVoGpzu4Cv9Eu+BzEOVNA== =Czel -----END PGP SIGNATURE----- Merge tag 'v5.15-rc2' into spi-5.15 Linux 5.15-rc2
This commit is contained in:
commit
ffb1e76f4f
.mailmap
Documentation
ABI
stable
testing
configfs-usb-gadget-uac1configfs-usb-gadget-uac2debugfs-driver-habanalabsdell-smbios-wmiima_policysysfs-blocksysfs-block-devicesysfs-bus-event_source-devices-uncoresysfs-bus-iio-chemical-sgp40sysfs-bus-pcisysfs-bus-platformsysfs-bus-thunderboltsysfs-class-firmware-attributessysfs-devices-system-cpusysfs-driver-ge-achcsysfs-driver-intc_sarsysfs-driver-ufssysfs-fs-f2fssysfs-kernel-dmabuf-bufferssysfs-kernel-iommu_groupssysfs-kernel-mm-numasysfs-platform-dell-smbiossysfs-platform-dptfsysfs-platform-intel-wmi-thunderboltsysfs-platform_profilesysfs-powersysfs-ptp
PCI
RCU
admin-guide
README.rst
acpi
binderfs.rstbootconfig.rstcgroup-v2.rstcputopology.rstdevice-mapper
devices.txthw-vuln
kernel-parameters.txtlaptops
mm
sysctl
sysrq.rstarm
arm64
atomic_t.txtblock
bpf
conf.pycore-api
cpu-freq
dev-tools
devicetree/bindings
2
.mailmap
2
.mailmap
@ -229,6 +229,7 @@ Matthew Wilcox <willy@infradead.org> <mawilcox@microsoft.com>
|
||||
Matthew Wilcox <willy@infradead.org> <willy@debian.org>
|
||||
Matthew Wilcox <willy@infradead.org> <willy@linux.intel.com>
|
||||
Matthew Wilcox <willy@infradead.org> <willy@parisc-linux.org>
|
||||
Matthias Fuchs <socketcan@esd.eu> <matthias.fuchs@esd.eu>
|
||||
Matthieu CASTET <castet.matthieu@free.fr>
|
||||
Matt Ranostay <matt.ranostay@konsulko.com> <matt@ranostay.consulting>
|
||||
Matt Ranostay <mranostay@gmail.com> Matthew Ranostay <mranostay@embeddedalley.com>
|
||||
@ -341,6 +342,7 @@ Sumit Semwal <sumit.semwal@ti.com>
|
||||
Takashi YOSHII <takashi.yoshii.zj@renesas.com>
|
||||
Tejun Heo <htejun@gmail.com>
|
||||
Thomas Graf <tgraf@suug.ch>
|
||||
Thomas Körper <socketcan@esd.eu> <thomas.koerper@esd.eu>
|
||||
Thomas Pedersen <twp@codeaurora.org>
|
||||
Tiezhu Yang <yangtiezhu@loongson.cn> <kernelpatch@126.com>
|
||||
Todor Tomov <todor.too@gmail.com> <todor.tomov@linaro.org>
|
||||
|
@ -128,6 +128,8 @@ Date: Aug 28, 2020
|
||||
KernelVersion: 5.10.0
|
||||
Contact: dmaengine@vger.kernel.org
|
||||
Description: The last executed device administrative command's status/error.
|
||||
Also last configuration error overloaded.
|
||||
Writing to it will clear the status.
|
||||
|
||||
What: /sys/bus/dsa/devices/wq<m>.<n>/block_on_fault
|
||||
Date: Oct 27, 2020
|
||||
@ -211,6 +213,13 @@ Contact: dmaengine@vger.kernel.org
|
||||
Description: Indicate whether ATS disable is turned on for the workqueue.
|
||||
0 indicates ATS is on, and 1 indicates ATS is off for the workqueue.
|
||||
|
||||
What: /sys/bus/dsa/devices/wq<m>.<n>/occupancy
|
||||
Date May 25, 2021
|
||||
KernelVersion: 5.14.0
|
||||
Contact: dmaengine@vger.kernel.org
|
||||
Description: Show the current number of entries in this WQ if WQ Occupancy
|
||||
Support bit WQ capabilities is 1.
|
||||
|
||||
What: /sys/bus/dsa/devices/engine<m>.<n>/group_id
|
||||
Date: Oct 25, 2019
|
||||
KernelVersion: 5.6.0
|
||||
|
@ -8,9 +8,19 @@ Description:
|
||||
c_chmask capture channel mask
|
||||
c_srate capture sampling rate
|
||||
c_ssize capture sample size (bytes)
|
||||
c_mute_present capture mute control enable
|
||||
c_volume_present capture volume control enable
|
||||
c_volume_min capture volume control min value (in 1/256 dB)
|
||||
c_volume_max capture volume control max value (in 1/256 dB)
|
||||
c_volume_res capture volume control resolution (in 1/256 dB)
|
||||
p_chmask playback channel mask
|
||||
p_srate playback sampling rate
|
||||
p_ssize playback sample size (bytes)
|
||||
p_mute_present playback mute control enable
|
||||
p_volume_present playback volume control enable
|
||||
p_volume_min playback volume control min value (in 1/256 dB)
|
||||
p_volume_max playback volume control max value (in 1/256 dB)
|
||||
p_volume_res playback volume control resolution (in 1/256 dB)
|
||||
req_number the number of pre-allocated request
|
||||
for both capture and playback
|
||||
========== ===================================
|
||||
|
@ -9,8 +9,18 @@ Description:
|
||||
c_srate capture sampling rate
|
||||
c_ssize capture sample size (bytes)
|
||||
c_sync capture synchronization type (async/adaptive)
|
||||
c_mute_present capture mute control enable
|
||||
c_volume_present capture volume control enable
|
||||
c_volume_min capture volume control min value (in 1/256 dB)
|
||||
c_volume_max capture volume control max value (in 1/256 dB)
|
||||
c_volume_res capture volume control resolution (in 1/256 dB)
|
||||
fb_max maximum extra bandwidth in async mode
|
||||
p_chmask playback channel mask
|
||||
p_srate playback sampling rate
|
||||
p_ssize playback sample size (bytes)
|
||||
p_mute_present playback mute control enable
|
||||
p_volume_present playback volume control enable
|
||||
p_volume_min playback volume control min value (in 1/256 dB)
|
||||
p_volume_max playback volume control max value (in 1/256 dB)
|
||||
p_volume_res playback volume control resolution (in 1/256 dB)
|
||||
========= ============================
|
||||
|
@ -215,6 +215,17 @@ Description: Sets the skip reset on timeout option for the device. Value of
|
||||
"0" means device will be reset in case some CS has timed out,
|
||||
otherwise it will not be reset.
|
||||
|
||||
What: /sys/kernel/debug/habanalabs/hl<n>/state_dump
|
||||
Date: Oct 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: ynudelman@habana.ai
|
||||
Description: Gets the state dump occurring on a CS timeout or failure.
|
||||
State dump is used for debug and is created each time in case of
|
||||
a problem in a CS execution, before reset.
|
||||
Reading from the node returns the newest state dump available.
|
||||
Writing an integer X discards X state dumps, so that the
|
||||
next read would return X+1-st newest state dump.
|
||||
|
||||
What: /sys/kernel/debug/habanalabs/hl<n>/stop_on_err
|
||||
Date: Mar 2020
|
||||
KernelVersion: 5.6
|
||||
@ -230,6 +241,14 @@ Description: Displays a list with information about the currently user
|
||||
pointers (user virtual addresses) that are pinned and mapped
|
||||
to DMA addresses
|
||||
|
||||
What: /sys/kernel/debug/habanalabs/hl<n>/userptr_lookup
|
||||
Date: Aug 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: ogabbay@kernel.org
|
||||
Description: Allows to search for specific user pointers (user virtual
|
||||
addresses) that are pinned and mapped to DMA addresses, and see
|
||||
their resolution to the specific dma address.
|
||||
|
||||
What: /sys/kernel/debug/habanalabs/hl<n>/vm
|
||||
Date: Jan 2019
|
||||
KernelVersion: 5.1
|
||||
|
@ -1,7 +1,7 @@
|
||||
What: /dev/wmi/dell-smbios
|
||||
Date: November 2017
|
||||
KernelVersion: 4.15
|
||||
Contact: "Mario Limonciello" <mario.limonciello@dell.com>
|
||||
Contact: Dell.Client.Kernel@dell.com
|
||||
Description:
|
||||
Perform SMBIOS calls on supported Dell machines.
|
||||
through the Dell ACPI-WMI interface.
|
||||
|
@ -27,12 +27,13 @@ Description:
|
||||
lsm: [[subj_user=] [subj_role=] [subj_type=]
|
||||
[obj_user=] [obj_role=] [obj_type=]]
|
||||
option: [[appraise_type=]] [template=] [permit_directio]
|
||||
[appraise_flag=] [keyrings=]
|
||||
[appraise_flag=] [appraise_algos=] [keyrings=]
|
||||
base:
|
||||
func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK]
|
||||
[FIRMWARE_CHECK]
|
||||
[FIRMWARE_CHECK]
|
||||
[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
|
||||
[KEXEC_CMDLINE] [KEY_CHECK] [CRITICAL_DATA]
|
||||
[SETXATTR_CHECK]
|
||||
mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND]
|
||||
[[^]MAY_EXEC]
|
||||
fsmagic:= hex value
|
||||
@ -55,6 +56,10 @@ Description:
|
||||
label:= [selinux]|[kernel_info]|[data_label]
|
||||
data_label:= a unique string used for grouping and limiting critical data.
|
||||
For example, "selinux" to measure critical data for SELinux.
|
||||
appraise_algos:= comma-separated list of hash algorithms
|
||||
For example, "sha256,sha512" to only accept to appraise
|
||||
files where the security.ima xattr was hashed with one
|
||||
of these two algorithms.
|
||||
|
||||
default policy:
|
||||
# PROC_SUPER_MAGIC
|
||||
@ -134,3 +139,9 @@ Description:
|
||||
keys added to .builtin_trusted_keys or .ima keyring:
|
||||
|
||||
measure func=KEY_CHECK keyrings=.builtin_trusted_keys|.ima
|
||||
|
||||
Example of the special SETXATTR_CHECK appraise rule, that
|
||||
restricts the hash algorithms allowed when writing to the
|
||||
security.ima xattr of a file:
|
||||
|
||||
appraise func=SETXATTR_CHECK appraise_algos=sha256,sha384,sha512
|
||||
|
@ -28,6 +28,18 @@ Description:
|
||||
For more details refer Documentation/admin-guide/iostats.rst
|
||||
|
||||
|
||||
What: /sys/block/<disk>/diskseq
|
||||
Date: February 2021
|
||||
Contact: Matteo Croce <mcroce@microsoft.com>
|
||||
Description:
|
||||
The /sys/block/<disk>/diskseq files reports the disk
|
||||
sequence number, which is a monotonically increasing
|
||||
number assigned to every drive.
|
||||
Some devices, like the loop device, refresh such number
|
||||
every time the backing file is changed.
|
||||
The value type is 64 bit unsigned.
|
||||
|
||||
|
||||
What: /sys/block/<disk>/<part>/stat
|
||||
Date: February 2008
|
||||
Contact: Jerome Marchand <jmarchan@redhat.com>
|
||||
|
@ -55,6 +55,43 @@ Date: Oct, 2016
|
||||
KernelVersion: v4.10
|
||||
Contact: linux-ide@vger.kernel.org
|
||||
Description:
|
||||
(RW) Write to the file to turn on or off the SATA ncq (native
|
||||
command queueing) support. By default this feature is turned
|
||||
off.
|
||||
(RW) Write to the file to turn on or off the SATA NCQ (native
|
||||
command queueing) priority support. By default this feature is
|
||||
turned off. If the device does not support the SATA NCQ
|
||||
priority feature, writing "1" to this file results in an error
|
||||
(see ncq_prio_supported).
|
||||
|
||||
|
||||
What: /sys/block/*/device/sas_ncq_prio_enable
|
||||
Date: Oct, 2016
|
||||
KernelVersion: v4.10
|
||||
Contact: linux-ide@vger.kernel.org
|
||||
Description:
|
||||
(RW) This is the equivalent of the ncq_prio_enable attribute
|
||||
file for SATA devices connected to a SAS host-bus-adapter
|
||||
(HBA) implementing support for the SATA NCQ priority feature.
|
||||
This file does not exist if the HBA driver does not implement
|
||||
support for the SATA NCQ priority feature, regardless of the
|
||||
device support for this feature (see sas_ncq_prio_supported).
|
||||
|
||||
|
||||
What: /sys/block/*/device/ncq_prio_supported
|
||||
Date: Aug, 2021
|
||||
KernelVersion: v5.15
|
||||
Contact: linux-ide@vger.kernel.org
|
||||
Description:
|
||||
(RO) Indicates if the device supports the SATA NCQ (native
|
||||
command queueing) priority feature.
|
||||
|
||||
|
||||
What: /sys/block/*/device/sas_ncq_prio_supported
|
||||
Date: Aug, 2021
|
||||
KernelVersion: v5.15
|
||||
Contact: linux-ide@vger.kernel.org
|
||||
Description:
|
||||
(RO) This is the equivalent of the ncq_prio_supported attribute
|
||||
file for SATA devices connected to a SAS host-bus-adapter
|
||||
(HBA) implementing support for the SATA NCQ priority feature.
|
||||
This file does not exist if the HBA driver does not implement
|
||||
support for the SATA NCQ priority feature, regardless of the
|
||||
device support for this feature.
|
||||
|
@ -0,0 +1,13 @@
|
||||
What: /sys/bus/event_source/devices/uncore_*/alias
|
||||
Date: June 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
|
||||
Description: Read-only. An attribute to describe the alias name of
|
||||
the uncore PMU if an alias exists on some platforms.
|
||||
The 'perf(1)' tool should treat both names the same.
|
||||
They both can be used to access the uncore PMU.
|
||||
|
||||
Example:
|
||||
|
||||
$ cat /sys/devices/uncore_cha_2/alias
|
||||
uncore_type_0_2
|
31
Documentation/ABI/testing/sysfs-bus-iio-chemical-sgp40
Normal file
31
Documentation/ABI/testing/sysfs-bus-iio-chemical-sgp40
Normal file
@ -0,0 +1,31 @@
|
||||
What: /sys/bus/iio/devices/iio:deviceX/out_temp_raw
|
||||
Date: August 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: Andreas Klinger <ak@it-klinger.de>
|
||||
Description:
|
||||
Set the temperature. This value is sent to the sensor for
|
||||
temperature compensation.
|
||||
Default value: 25000 (25 °C)
|
||||
|
||||
What: /sys/bus/iio/devices/iio:deviceX/out_humidityrelative_raw
|
||||
Date: August 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: Andreas Klinger <ak@it-klinger.de>
|
||||
Description:
|
||||
Set the relative humidity. This value is sent to the sensor for
|
||||
humidity compensation.
|
||||
Default value: 50000 (50 % relative humidity)
|
||||
|
||||
What: /sys/bus/iio/devices/iio:deviceX/in_resistance_calibbias
|
||||
Date: August 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: Andreas Klinger <ak@it-klinger.de>
|
||||
Description:
|
||||
Set the bias value for the resistance which is used for
|
||||
calculation of in_concentration_input as follows:
|
||||
|
||||
x = (in_resistance_raw - in_resistance_calibbias) * 0.65
|
||||
|
||||
in_concentration_input = 500 / (1 + e^x)
|
||||
|
||||
Default value: 30000
|
@ -121,6 +121,23 @@ Description:
|
||||
child buses, and re-discover devices removed earlier
|
||||
from this part of the device tree.
|
||||
|
||||
What: /sys/bus/pci/devices/.../reset_method
|
||||
Date: August 2021
|
||||
Contact: Amey Narkhede <ameynarkhede03@gmail.com>
|
||||
Description:
|
||||
Some devices allow an individual function to be reset
|
||||
without affecting other functions in the same slot.
|
||||
|
||||
For devices that have this support, a file named
|
||||
reset_method is present in sysfs. Reading this file
|
||||
gives names of the supported and enabled reset methods and
|
||||
their ordering. Writing a space-separated list of names of
|
||||
reset methods sets the reset methods and ordering to be
|
||||
used when resetting the device. Writing an empty string
|
||||
disables the ability to reset the device. Writing
|
||||
"default" enables all supported reset methods in the
|
||||
default ordering.
|
||||
|
||||
What: /sys/bus/pci/devices/.../reset
|
||||
Date: July 2009
|
||||
Contact: Michael S. Tsirkin <mst@redhat.com>
|
||||
|
@ -28,3 +28,17 @@ Description:
|
||||
value comes from an ACPI _PXM method or a similar firmware
|
||||
source. Initial users for this file would be devices like
|
||||
arm smmu which are populated by arm64 acpi_iort.
|
||||
|
||||
What: /sys/bus/platform/devices/.../msi_irqs/
|
||||
Date: August 2021
|
||||
Contact: Barry Song <song.bao.hua@hisilicon.com>
|
||||
Description:
|
||||
The /sys/devices/.../msi_irqs directory contains a variable set
|
||||
of files, with each file being named after a corresponding msi
|
||||
irq vector allocated to that device.
|
||||
|
||||
What: /sys/bus/platform/devices/.../msi_irqs/<N>
|
||||
Date: August 2021
|
||||
Contact: Barry Song <song.bao.hua@hisilicon.com>
|
||||
Description:
|
||||
This attribute will show "msi" if <N> is a valid msi irq
|
||||
|
@ -232,7 +232,7 @@ Description: When new NVM image is written to the non-active NVM
|
||||
What: /sys/bus/thunderbolt/devices/.../nvm_authenticate_on_disconnect
|
||||
Date: Oct 2020
|
||||
KernelVersion: v5.9
|
||||
Contact: Mario Limonciello <mario.limonciello@dell.com>
|
||||
Contact: Mario Limonciello <mario.limonciello@outlook.com>
|
||||
Description: For supported devices, automatically authenticate the new Thunderbolt
|
||||
image when the device is disconnected from the host system.
|
||||
|
||||
|
@ -2,8 +2,8 @@ What: /sys/class/firmware-attributes/*/attributes/*/
|
||||
Date: February 2021
|
||||
KernelVersion: 5.11
|
||||
Contact: Divya Bharathi <Divya.Bharathi@Dell.com>,
|
||||
Mario Limonciello <mario.limonciello@dell.com>,
|
||||
Prasanth KSR <prasanth.ksr@dell.com>
|
||||
Dell.Client.Kernel@dell.com
|
||||
Description:
|
||||
A sysfs interface for systems management software to enable
|
||||
configuration capability on supported systems. This directory
|
||||
@ -130,8 +130,8 @@ What: /sys/class/firmware-attributes/*/authentication/
|
||||
Date: February 2021
|
||||
KernelVersion: 5.11
|
||||
Contact: Divya Bharathi <Divya.Bharathi@Dell.com>,
|
||||
Mario Limonciello <mario.limonciello@dell.com>,
|
||||
Prasanth KSR <prasanth.ksr@dell.com>
|
||||
Dell.Client.Kernel@dell.com
|
||||
Description:
|
||||
Devices support various authentication mechanisms which can be exposed
|
||||
as a separate configuration object.
|
||||
@ -220,8 +220,8 @@ What: /sys/class/firmware-attributes/*/attributes/pending_reboot
|
||||
Date: February 2021
|
||||
KernelVersion: 5.11
|
||||
Contact: Divya Bharathi <Divya.Bharathi@Dell.com>,
|
||||
Mario Limonciello <mario.limonciello@dell.com>,
|
||||
Prasanth KSR <prasanth.ksr@dell.com>
|
||||
Dell.Client.Kernel@dell.com
|
||||
Description:
|
||||
A read-only attribute reads 1 if a reboot is necessary to apply
|
||||
pending BIOS attribute changes. Also, an uevent_KOBJ_CHANGE is
|
||||
@ -249,8 +249,8 @@ What: /sys/class/firmware-attributes/*/attributes/reset_bios
|
||||
Date: February 2021
|
||||
KernelVersion: 5.11
|
||||
Contact: Divya Bharathi <Divya.Bharathi@Dell.com>,
|
||||
Mario Limonciello <mario.limonciello@dell.com>,
|
||||
Prasanth KSR <prasanth.ksr@dell.com>
|
||||
Dell.Client.Kernel@dell.com
|
||||
Description:
|
||||
This attribute can be used to reset the BIOS Configuration.
|
||||
Specifically, it tells which type of reset BIOS configuration is being
|
||||
@ -272,3 +272,14 @@ Description:
|
||||
|
||||
Note that any changes to this attribute requires a reboot
|
||||
for changes to take effect.
|
||||
|
||||
What: /sys/class/firmware-attributes/*/attributes/debug_cmd
|
||||
Date: July 2021
|
||||
KernelVersion: 5.14
|
||||
Contact: Mark Pearson <markpearson@lenovo.com>
|
||||
Description:
|
||||
This write only attribute can be used to send debug commands to the BIOS.
|
||||
This should only be used when recommended by the BIOS vendor. Vendors may
|
||||
use it to enable extra debug attributes or BIOS features for testing purposes.
|
||||
|
||||
Note that any changes to this attribute requires a reboot for changes to take effect.
|
||||
|
@ -494,6 +494,15 @@ Description: AArch64 CPU registers
|
||||
'identification' directory exposes the CPU ID registers for
|
||||
identifying model and revision of the CPU.
|
||||
|
||||
What: /sys/devices/system/cpu/aarch32_el0
|
||||
Date: May 2021
|
||||
Contact: Linux ARM Kernel Mailing list <linux-arm-kernel@lists.infradead.org>
|
||||
Description: Identifies the subset of CPUs in the system that can execute
|
||||
AArch32 (32-bit ARM) applications. If present, the same format as
|
||||
/sys/devices/system/cpu/{offline,online,possible,present} is used.
|
||||
If absent, then all or none of the CPUs can execute AArch32
|
||||
applications and execve() will behave accordingly.
|
||||
|
||||
What: /sys/devices/system/cpu/cpu#/cpu_capacity
|
||||
Date: December 2016
|
||||
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
|
||||
@ -640,3 +649,20 @@ Description: SPURR ticks for cpuX when it was idle.
|
||||
|
||||
This sysfs interface exposes the number of SPURR ticks
|
||||
for cpuX when it was idle.
|
||||
|
||||
What: /sys/devices/system/cpu/cpuX/mte_tcf_preferred
|
||||
Date: July 2021
|
||||
Contact: Linux ARM Kernel Mailing list <linux-arm-kernel@lists.infradead.org>
|
||||
Description: Preferred MTE tag checking mode
|
||||
|
||||
When a user program specifies more than one MTE tag checking
|
||||
mode, this sysfs node is used to specify which mode should
|
||||
be preferred when scheduling a task on that CPU. Possible
|
||||
values:
|
||||
|
||||
================ ==============================================
|
||||
"sync" Prefer synchronous mode
|
||||
"async" Prefer asynchronous mode
|
||||
================ ==============================================
|
||||
|
||||
See also: Documentation/arm64/memory-tagging-extension.rst
|
||||
|
15
Documentation/ABI/testing/sysfs-driver-ge-achc
Normal file
15
Documentation/ABI/testing/sysfs-driver-ge-achc
Normal file
@ -0,0 +1,15 @@
|
||||
What: /sys/bus/spi/<dev>/update_firmware
|
||||
Date: Jul 2021
|
||||
Contact: sebastian.reichel@collabora.com
|
||||
Description: Write 1 to this file to update the ACHC microcontroller
|
||||
firmware via the EzPort interface. For this the kernel
|
||||
will load "achc.bin" via the firmware API (so usually
|
||||
from /lib/firmware). The write will block until the FW
|
||||
has either been flashed successfully or an error occured.
|
||||
|
||||
What: /sys/bus/spi/<dev>/reset
|
||||
Date: Jul 2021
|
||||
Contact: sebastian.reichel@collabora.com
|
||||
Description: This file represents the microcontroller's reset line.
|
||||
1 means the reset line is asserted, 0 means it's not
|
||||
asserted. The file is read and writable.
|
54
Documentation/ABI/testing/sysfs-driver-intc_sar
Normal file
54
Documentation/ABI/testing/sysfs-driver-intc_sar
Normal file
@ -0,0 +1,54 @@
|
||||
What: /sys/bus/platform/devices/INTC1092:00/intc_reg
|
||||
Date: August 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: Shravan S <s.shravan@intel.com>,
|
||||
An Sudhakar <sudhakar.an@intel.com>
|
||||
Description:
|
||||
Specific Absorption Rate (SAR) regulatory mode is typically
|
||||
derived based on information like mcc (Mobile Country Code) and
|
||||
mnc (Mobile Network Code) that is available for the currently
|
||||
attached LTE network. A userspace application is required to set
|
||||
the current SAR regulatory mode on the Dynamic SAR driver using
|
||||
this sysfs node. Such an application can also read back using
|
||||
this sysfs node, the currently configured regulatory mode value
|
||||
from the Dynamic SAR driver.
|
||||
|
||||
Acceptable regulatory modes are:
|
||||
== ====
|
||||
0 FCC
|
||||
1 CE
|
||||
2 ISED
|
||||
== ====
|
||||
|
||||
- The regulatory mode value has one of the above values.
|
||||
- The default regulatory mode used in the driver is 0.
|
||||
|
||||
What: /sys/bus/platform/devices/INTC1092:00/intc_data
|
||||
Date: August 2021
|
||||
KernelVersion: 5.15
|
||||
Contact: Shravan S <s.shravan@intel.com>,
|
||||
An Sudhakar <sudhakar.an@intel.com>
|
||||
Description:
|
||||
This sysfs entry is used to retrieve Dynamic SAR information
|
||||
emitted/maintained by a BIOS that supports Dynamic SAR.
|
||||
|
||||
The retrieved information is in the order given below:
|
||||
- device_mode
|
||||
- bandtable_index
|
||||
- antennatable_index
|
||||
- sartable_index
|
||||
|
||||
The above information is sent as integer values separated
|
||||
by a single space. This information can then be pushed to a
|
||||
WWAN modem that uses this to control the transmit signal
|
||||
level using the Band/Antenna/SAR table index information.
|
||||
These parameters are derived/decided by aggregating
|
||||
device-mode like laptop/tablet/clamshell etc. and the
|
||||
proximity-sensor data available to the embedded controller on
|
||||
given host. The regulatory mode configured on Dynamic SAR
|
||||
driver also influences these values.
|
||||
|
||||
The userspace applications can poll for changes to this file
|
||||
using POLLPRI event on file-descriptor (fd) obtained by opening
|
||||
this sysfs entry. Application can then read this information from
|
||||
the sysfs node and consume the given information.
|
@ -1298,3 +1298,239 @@ Description: This node is used to set or display whether UFS WriteBooster is
|
||||
(if the platform supports UFSHCD_CAP_CLK_SCALING). For a
|
||||
platform that doesn't support UFSHCD_CAP_CLK_SCALING, we can
|
||||
disable/enable WriteBooster through this sysfs node.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/device_descriptor/hpb_version
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the HPB specification version.
|
||||
The full information about the descriptor can be found in the UFS
|
||||
HPB (Host Performance Booster) Extension specifications.
|
||||
Example: version 1.2.3 = 0123h
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/device_descriptor/hpb_control
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows an indication of the HPB control mode.
|
||||
00h: Host control mode
|
||||
01h: Device control mode
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/geometry_descriptor/hpb_region_size
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the bHPBRegionSize which can be calculated
|
||||
as in the following (in bytes):
|
||||
HPB Region size = 512B * 2^bHPBRegionSize
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/geometry_descriptor/hpb_number_lu
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the maximum number of HPB LU supported by
|
||||
the device.
|
||||
00h: HPB is not supported by the device.
|
||||
01h ~ 20h: Maximum number of HPB LU supported by the device
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/geometry_descriptor/hpb_subregion_size
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the bHPBSubRegionSize, which can be
|
||||
calculated as in the following (in bytes) and shall be a multiple of
|
||||
logical block size:
|
||||
HPB Sub-Region size = 512B x 2^bHPBSubRegionSize
|
||||
bHPBSubRegionSize shall not exceed bHPBRegionSize.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/geometry_descriptor/hpb_max_active_regions
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the maximum number of active HPB regions that
|
||||
is supported by the device.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/unit_descriptor/hpb_lu_max_active_regions
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the maximum number of HPB regions assigned to
|
||||
the HPB logical unit.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/unit_descriptor/hpb_pinned_region_start_offset
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the start offset of HPB pinned region.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/unit_descriptor/hpb_number_pinned_regions
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the number of HPB pinned regions assigned to
|
||||
the HPB logical unit.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_stats/hit_cnt
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the number of reads that changed to HPB read.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_stats/miss_cnt
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the number of reads that cannot be changed to
|
||||
HPB read.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_stats/rb_noti_cnt
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the number of response UPIUs that has
|
||||
recommendations for activating sub-regions and/or inactivating region.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_stats/rb_active_cnt
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the number of active sub-regions recommended by
|
||||
response UPIUs.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_stats/rb_inactive_cnt
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the number of inactive regions recommended by
|
||||
response UPIUs.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_stats/map_req_cnt
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the number of read buffer commands for
|
||||
activating sub-regions recommended by response UPIUs.
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_params/requeue_timeout_ms
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the requeue timeout threshold for write buffer
|
||||
command in ms. The value can be changed by writing an integer to
|
||||
this entry.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/attributes/max_data_size_hpb_single_cmd
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the maximum HPB data size for using a single HPB
|
||||
command.
|
||||
|
||||
=== ========
|
||||
00h 4KB
|
||||
01h 8KB
|
||||
02h 12KB
|
||||
...
|
||||
FFh 1024KB
|
||||
=== ========
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/bus/platform/drivers/ufshcd/*/flags/hpb_enable
|
||||
Date: June 2021
|
||||
Contact: Daejun Park <daejun7.park@samsung.com>
|
||||
Description: This entry shows the status of HPB.
|
||||
|
||||
== ============================
|
||||
0 HPB is not enabled.
|
||||
1 HPB is enabled
|
||||
== ============================
|
||||
|
||||
The file is read only.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/activation_thld
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: In host control mode, reads are the major source of activation
|
||||
trials. Once this threshold hs met, the region is added to the
|
||||
"to-be-activated" list. Since we reset the read counter upon
|
||||
write, this include sending a rb command updating the region
|
||||
ppn as well.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/normalization_factor
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: In host control mode, we think of the regions as "buckets".
|
||||
Those buckets are being filled with reads, and emptied on write.
|
||||
We use entries_per_srgn - the amount of blocks in a subregion as
|
||||
our bucket size. This applies because HPB1.0 only handles
|
||||
single-block reads. Once the bucket size is crossed, we trigger
|
||||
a normalization work - not only to avoid overflow, but mainly
|
||||
because we want to keep those counters normalized, as we are
|
||||
using those reads as a comparative score, to make various decisions.
|
||||
The normalization is dividing (shift right) the read counter by
|
||||
the normalization_factor. If during consecutive normalizations
|
||||
an active region has exhausted its reads - inactivate it.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/eviction_thld_enter
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: Region deactivation is often due to the fact that eviction took
|
||||
place: A region becomes active at the expense of another. This is
|
||||
happening when the max-active-regions limit has been crossed.
|
||||
In host mode, eviction is considered an extreme measure. We
|
||||
want to verify that the entering region has enough reads, and
|
||||
the exiting region has much fewer reads. eviction_thld_enter is
|
||||
the min reads that a region must have in order to be considered
|
||||
a candidate for evicting another region.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/eviction_thld_exit
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: Same as above for the exiting region. A region is considered to
|
||||
be a candidate for eviction only if it has fewer reads than
|
||||
eviction_thld_exit.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/read_timeout_ms
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: In order not to hang on to "cold" regions, we inactivate
|
||||
a region that has no READ access for a predefined amount of
|
||||
time - read_timeout_ms. If read_timeout_ms has expired, and the
|
||||
region is dirty, it is less likely that we can make any use of
|
||||
HPB reading it so we inactivate it. Still, deactivation has
|
||||
its overhead, and we may still benefit from HPB reading this
|
||||
region if it is clean - see read_timeout_expiries.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/read_timeout_expiries
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: If the region read timeout has expired, but the region is clean,
|
||||
just re-wind its timer for another spin. Do that as long as it
|
||||
is clean and did not exhaust its read_timeout_expiries threshold.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/timeout_polling_interval_ms
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: The frequency with which the delayed worker that checks the
|
||||
read_timeouts is awakened.
|
||||
|
||||
What: /sys/class/scsi_device/*/device/hpb_param_sysfs/inflight_map_req
|
||||
Date: February 2021
|
||||
Contact: Avri Altman <avri.altman@wdc.com>
|
||||
Description: In host control mode the host is the originator of map requests.
|
||||
To avoid flooding the device with map requests, use a simple throttling
|
||||
mechanism that limits the number of inflight map requests.
|
||||
|
@ -41,8 +41,7 @@ Description: This parameter controls the number of prefree segments to be
|
||||
What: /sys/fs/f2fs/<disk>/main_blkaddr
|
||||
Date: November 2019
|
||||
Contact: "Ramon Pantin" <pantin@google.com>
|
||||
Description:
|
||||
Shows first block address of MAIN area.
|
||||
Description: Shows first block address of MAIN area.
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/ipu_policy
|
||||
Date: November 2013
|
||||
@ -493,3 +492,23 @@ Contact: "Chao Yu" <yuchao0@huawei.com>
|
||||
Description: When ATGC is on, it controls age threshold to bypass GCing young
|
||||
candidates whose age is not beyond the threshold, by default it was
|
||||
initialized as 604800 seconds (equals to 7 days).
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/gc_reclaimed_segments
|
||||
Date: July 2021
|
||||
Contact: "Daeho Jeong" <daehojeong@google.com>
|
||||
Description: Show how many segments have been reclaimed by GC during a specific
|
||||
GC mode (0: GC normal, 1: GC idle CB, 2: GC idle greedy,
|
||||
3: GC idle AT, 4: GC urgent high, 5: GC urgent low)
|
||||
You can re-initialize this value to "0".
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/gc_segment_mode
|
||||
Date: July 2021
|
||||
Contact: "Daeho Jeong" <daehojeong@google.com>
|
||||
Description: You can control for which gc mode the "gc_reclaimed_segments" node shows.
|
||||
Refer to the description of the modes in "gc_reclaimed_segments".
|
||||
|
||||
What: /sys/fs/f2fs/<disk>/seq_file_ra_mul
|
||||
Date: July 2021
|
||||
Contact: "Daeho Jeong" <daehojeong@google.com>
|
||||
Description: You can control the multiplier value of bdi device readahead window size
|
||||
between 2 (default) and 256 for POSIX_FADV_SEQUENTIAL advise option.
|
||||
|
24
Documentation/ABI/testing/sysfs-kernel-dmabuf-buffers
Normal file
24
Documentation/ABI/testing/sysfs-kernel-dmabuf-buffers
Normal file
@ -0,0 +1,24 @@
|
||||
What: /sys/kernel/dmabuf/buffers
|
||||
Date: May 2021
|
||||
KernelVersion: v5.13
|
||||
Contact: Hridya Valsaraju <hridya@google.com>
|
||||
Description: The /sys/kernel/dmabuf/buffers directory contains a
|
||||
snapshot of the internal state of every DMA-BUF.
|
||||
/sys/kernel/dmabuf/buffers/<inode_number> will contain the
|
||||
statistics for the DMA-BUF with the unique inode number
|
||||
<inode_number>
|
||||
Users: kernel memory tuning/debugging tools
|
||||
|
||||
What: /sys/kernel/dmabuf/buffers/<inode_number>/exporter_name
|
||||
Date: May 2021
|
||||
KernelVersion: v5.13
|
||||
Contact: Hridya Valsaraju <hridya@google.com>
|
||||
Description: This file is read-only and contains the name of the exporter of
|
||||
the DMA-BUF.
|
||||
|
||||
What: /sys/kernel/dmabuf/buffers/<inode_number>/size
|
||||
Date: May 2021
|
||||
KernelVersion: v5.13
|
||||
Contact: Hridya Valsaraju <hridya@google.com>
|
||||
Description: This file is read-only and specifies the size of the DMA-BUF in
|
||||
bytes.
|
@ -42,8 +42,12 @@ Description: /sys/kernel/iommu_groups/<grp_id>/type shows the type of default
|
||||
======== ======================================================
|
||||
DMA All the DMA transactions from the device in this group
|
||||
are translated by the iommu.
|
||||
DMA-FQ As above, but using batched invalidation to lazily
|
||||
remove translations after use. This may offer reduced
|
||||
overhead at the cost of reduced memory protection.
|
||||
identity All the DMA transactions from the device in this group
|
||||
are not translated by the iommu.
|
||||
are not translated by the iommu. Maximum performance
|
||||
but zero protection.
|
||||
auto Change to the type the device was booted with.
|
||||
======== ======================================================
|
||||
|
||||
|
24
Documentation/ABI/testing/sysfs-kernel-mm-numa
Normal file
24
Documentation/ABI/testing/sysfs-kernel-mm-numa
Normal file
@ -0,0 +1,24 @@
|
||||
What: /sys/kernel/mm/numa/
|
||||
Date: June 2021
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: Interface for NUMA
|
||||
|
||||
What: /sys/kernel/mm/numa/demotion_enabled
|
||||
Date: June 2021
|
||||
Contact: Linux memory management mailing list <linux-mm@kvack.org>
|
||||
Description: Enable/disable demoting pages during reclaim
|
||||
|
||||
Page migration during reclaim is intended for systems
|
||||
with tiered memory configurations. These systems have
|
||||
multiple types of memory with varied performance
|
||||
characteristics instead of plain NUMA systems where
|
||||
the same kind of memory is found at varied distances.
|
||||
Allowing page migration during reclaim enables these
|
||||
systems to migrate pages from fast tiers to slow tiers
|
||||
when the fast tier is under pressure. This migration
|
||||
is performed before swap. It may move data to a NUMA
|
||||
node that does not fall into the cpuset of the
|
||||
allocating process which might be construed to violate
|
||||
the guarantees of cpusets. This should not be enabled
|
||||
on systems which need strict cpuset location
|
||||
guarantees.
|
@ -1,7 +1,7 @@
|
||||
What: /sys/devices/platform/<platform>/tokens/*
|
||||
Date: November 2017
|
||||
KernelVersion: 4.15
|
||||
Contact: "Mario Limonciello" <mario.limonciello@dell.com>
|
||||
Contact: Dell.Client.Kernel@dell.com
|
||||
Description:
|
||||
A read-only description of Dell platform tokens
|
||||
available on the machine.
|
||||
|
@ -111,3 +111,43 @@ Contact: linux-acpi@vger.kernel.org
|
||||
Description:
|
||||
(RW) The PCH FIVR (Fully Integrated Voltage Regulator) switching frequency in MHz,
|
||||
when FIVR clock is 38.4MHz.
|
||||
|
||||
What: /sys/bus/platform/devices/INTC1045:00/pch_fivr_switch_frequency/fivr_switching_freq_mhz
|
||||
Date: September, 2021
|
||||
KernelVersion: v5.15
|
||||
Contact: linux-acpi@vger.kernel.org
|
||||
Description:
|
||||
(RO) Get the FIVR switching control frequency in MHz.
|
||||
|
||||
What: /sys/bus/platform/devices/INTC1045:00/pch_fivr_switch_frequency/fivr_switching_fault_status
|
||||
Date: September, 2021
|
||||
KernelVersion: v5.15
|
||||
Contact: linux-acpi@vger.kernel.org
|
||||
Description:
|
||||
(RO) Read the FIVR switching frequency control fault status.
|
||||
|
||||
What: /sys/bus/platform/devices/INTC1045:00/pch_fivr_switch_frequency/ssc_clock_info
|
||||
Date: September, 2021
|
||||
KernelVersion: v5.15
|
||||
Contact: linux-acpi@vger.kernel.org
|
||||
Description:
|
||||
(RO) Presents SSC (spread spectrum clock) information for EMI
|
||||
(Electro magnetic interference) control. This is a bit mask.
|
||||
Bits Description
|
||||
[7:0] Sets clock spectrum spread percentage:
|
||||
0x00=0.2% , 0x3F=10%
|
||||
1 LSB = 0.1% increase in spread (for
|
||||
settings 0x01 thru 0x1C)
|
||||
1 LSB = 0.2% increase in spread (for
|
||||
settings 0x1E thru 0x3F)
|
||||
[8] When set to 1, enables spread
|
||||
spectrum clock
|
||||
[9] 0: Triangle mode. FFC frequency
|
||||
walks around the Fcenter in a linear
|
||||
fashion
|
||||
1: Random walk mode. FFC frequency
|
||||
changes randomly within the SSC
|
||||
(Spread spectrum clock) range
|
||||
[10] 0: No white noise. 1: Add white noise
|
||||
to spread waveform
|
||||
[11] When 1, future writes are ignored.
|
||||
|
@ -1,7 +1,7 @@
|
||||
What: /sys/devices/platform/<platform>/force_power
|
||||
Date: September 2017
|
||||
KernelVersion: 4.15
|
||||
Contact: "Mario Limonciello" <mario.limonciello@dell.com>
|
||||
Contact: "Mario Limonciello" <mario.limonciello@outlook.com>
|
||||
Description:
|
||||
Modify the platform force power state, influencing
|
||||
Thunderbolt controllers to turn on or off when no
|
||||
|
@ -26,3 +26,10 @@ Contact: Hans de Goede <hdegoede@redhat.com>
|
||||
Description: Reading this file gives the current selected profile for this
|
||||
device. Writing this file with one of the strings from
|
||||
platform_profile_choices changes the profile to the new value.
|
||||
|
||||
This file can be monitored for changes by polling for POLLPRI,
|
||||
POLLPRI will be signalled on any changes, independent of those
|
||||
changes coming from a userspace write; or coming from another
|
||||
source such as e.g. a hotkey triggered profile change handled
|
||||
either directly by the embedded-controller or fully handled
|
||||
inside the kernel.
|
||||
|
@ -295,7 +295,7 @@ Description:
|
||||
|
||||
What: /sys/power/resume_offset
|
||||
Date: April 2018
|
||||
Contact: Mario Limonciello <mario.limonciello@dell.com>
|
||||
Contact: Mario Limonciello <mario.limonciello@outlook.com>
|
||||
Description:
|
||||
This file is used for telling the kernel an offset into a disk
|
||||
to use when hibernating the system such as with a swap file.
|
||||
|
@ -33,6 +33,13 @@ Description:
|
||||
frequency adjustment value (a positive integer) in
|
||||
parts per billion.
|
||||
|
||||
What: /sys/class/ptp/ptpN/max_vclocks
|
||||
Date: May 2021
|
||||
Contact: Yangbo Lu <yangbo.lu@nxp.com>
|
||||
Description:
|
||||
This file contains the maximum number of ptp vclocks.
|
||||
Write integer to re-configure it.
|
||||
|
||||
What: /sys/class/ptp/ptpN/n_alarms
|
||||
Date: September 2010
|
||||
Contact: Richard Cochran <richardcochran@gmail.com>
|
||||
@ -61,6 +68,19 @@ Description:
|
||||
This file contains the number of programmable pins
|
||||
offered by the PTP hardware clock.
|
||||
|
||||
What: /sys/class/ptp/ptpN/n_vclocks
|
||||
Date: May 2021
|
||||
Contact: Yangbo Lu <yangbo.lu@nxp.com>
|
||||
Description:
|
||||
This file contains the number of virtual PTP clocks in
|
||||
use. By default, the value is 0 meaning that only the
|
||||
physical clock is in use. Setting the value creates
|
||||
the corresponding number of virtual clocks and causes
|
||||
the physical clock to become free running. Setting the
|
||||
value back to 0 deletes the virtual clocks and
|
||||
switches the physical clock back to normal, adjustable
|
||||
operation.
|
||||
|
||||
What: /sys/class/ptp/ptpN/pins
|
||||
Date: March 2014
|
||||
Contact: Richard Cochran <richardcochran@gmail.com>
|
||||
|
@ -43,6 +43,7 @@ entries corresponding to EPF driver will be created by the EPF core.
|
||||
.. <EPF Driver1>/
|
||||
... <EPF Device 11>/
|
||||
... <EPF Device 21>/
|
||||
... <EPF Device 31>/
|
||||
.. <EPF Driver2>/
|
||||
... <EPF Device 12>/
|
||||
... <EPF Device 22>/
|
||||
@ -68,6 +69,7 @@ created)
|
||||
... subsys_vendor_id
|
||||
... subsys_id
|
||||
... interrupt_pin
|
||||
... <Symlink EPF Device 31>/
|
||||
... primary/
|
||||
... <Symlink EPC Device1>/
|
||||
... secondary/
|
||||
@ -79,6 +81,13 @@ interface should be added in 'primary' directory and symlink of endpoint
|
||||
controller connected to secondary interface should be added in 'secondary'
|
||||
directory.
|
||||
|
||||
The <EPF Device> directory can have a list of symbolic links
|
||||
(<Symlink EPF Device 31>) to other <EPF Device>. These symbolic links should
|
||||
be created by the user to represent the virtual functions that are bound to
|
||||
the physical function. In the above directory structure <EPF Device 11> is a
|
||||
physical function and <EPF Device 31> is a virtual function. An EPF device once
|
||||
it's linked to another EPF device, cannot be linked to a EPC device.
|
||||
|
||||
EPC Device
|
||||
==========
|
||||
|
||||
@ -98,7 +107,8 @@ entries corresponding to EPC device will be created by the EPC core.
|
||||
|
||||
The <EPC Device> directory will have a list of symbolic links to
|
||||
<EPF Device>. These symbolic links should be created by the user to
|
||||
represent the functions present in the endpoint device.
|
||||
represent the functions present in the endpoint device. Only <EPF Device>
|
||||
that represents a physical function can be linked to a EPC device.
|
||||
|
||||
The <EPC Device> directory will also have a *start* field. Once
|
||||
"1" is written to this field, the endpoint device will be ready to
|
||||
|
@ -103,6 +103,7 @@ need pass only as many optional fields as necessary:
|
||||
- subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
|
||||
- class and classmask fields default to 0
|
||||
- driver_data defaults to 0UL.
|
||||
- override_only field defaults to 0.
|
||||
|
||||
Note that driver_data must match the value used by any of the pci_device_id
|
||||
entries defined in the driver. This makes the driver_data field mandatory
|
||||
|
@ -112,6 +112,35 @@ on PowerPC.
|
||||
The ``smp_mb__after_unlock_lock()`` invocations prevent this
|
||||
``WARN_ON()`` from triggering.
|
||||
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Quick Quiz**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| But the chain of rcu_node-structure lock acquisitions guarantees |
|
||||
| that new readers will see all of the updater's pre-grace-period |
|
||||
| accesses and also guarantees that the updater's post-grace-period |
|
||||
| accesses will see all of the old reader's accesses. So why do we |
|
||||
| need all of those calls to smp_mb__after_unlock_lock()? |
|
||||
+-----------------------------------------------------------------------+
|
||||
| **Answer**: |
|
||||
+-----------------------------------------------------------------------+
|
||||
| Because we must provide ordering for RCU's polling grace-period |
|
||||
| primitives, for example, get_state_synchronize_rcu() and |
|
||||
| poll_state_synchronize_rcu(). Consider this code:: |
|
||||
| |
|
||||
| CPU 0 CPU 1 |
|
||||
| ---- ---- |
|
||||
| WRITE_ONCE(X, 1) WRITE_ONCE(Y, 1) |
|
||||
| g = get_state_synchronize_rcu() smp_mb() |
|
||||
| while (!poll_state_synchronize_rcu(g)) r1 = READ_ONCE(X) |
|
||||
| continue; |
|
||||
| r0 = READ_ONCE(Y) |
|
||||
| |
|
||||
| RCU guarantees that the outcome r0 == 0 && r1 == 0 will not |
|
||||
| happen, even if CPU 1 is in an RCU extended quiescent state |
|
||||
| (idle or offline) and thus won't interact directly with the RCU |
|
||||
| core processing at all. |
|
||||
+-----------------------------------------------------------------------+
|
||||
|
||||
This approach must be extended to include idle CPUs, which need
|
||||
RCU's grace-period memory ordering guarantee to extend to any
|
||||
RCU read-side critical sections preceding and following the current
|
||||
|
@ -362,9 +362,8 @@ do_something_gp() uses rcu_dereference() to fetch from ``gp``:
|
||||
12 }
|
||||
|
||||
The rcu_dereference() uses volatile casts and (for DEC Alpha) memory
|
||||
barriers in the Linux kernel. Should a `high-quality implementation of
|
||||
C11 ``memory_order_consume``
|
||||
[PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__
|
||||
barriers in the Linux kernel. Should a |high-quality implementation of
|
||||
C11 memory_order_consume [PDF]|_
|
||||
ever appear, then rcu_dereference() could be implemented as a
|
||||
``memory_order_consume`` load. Regardless of the exact implementation, a
|
||||
pointer fetched by rcu_dereference() may not be used outside of the
|
||||
@ -374,6 +373,9 @@ element has been passed from RCU to some other synchronization
|
||||
mechanism, most commonly locking or `reference
|
||||
counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__.
|
||||
|
||||
.. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
|
||||
.. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
|
||||
|
||||
In short, updaters use rcu_assign_pointer() and readers use
|
||||
rcu_dereference(), and these two RCU API elements work together to
|
||||
ensure that readers have a consistent view of newly added data elements.
|
||||
|
@ -37,7 +37,7 @@ over a rather long period of time, but improvements are always welcome!
|
||||
|
||||
1. Does the update code have proper mutual exclusion?
|
||||
|
||||
RCU does allow -readers- to run (almost) naked, but -writers- must
|
||||
RCU does allow *readers* to run (almost) naked, but *writers* must
|
||||
still use some sort of mutual exclusion, such as:
|
||||
|
||||
a. locking,
|
||||
@ -73,7 +73,7 @@ over a rather long period of time, but improvements are always welcome!
|
||||
critical section is every bit as bad as letting them leak out
|
||||
from under a lock. Unless, of course, you have arranged some
|
||||
other means of protection, such as a lock or a reference count
|
||||
-before- letting them out of the RCU read-side critical section.
|
||||
*before* letting them out of the RCU read-side critical section.
|
||||
|
||||
3. Does the update code tolerate concurrent accesses?
|
||||
|
||||
@ -101,7 +101,7 @@ over a rather long period of time, but improvements are always welcome!
|
||||
c. Make updates appear atomic to readers. For example,
|
||||
pointer updates to properly aligned fields will
|
||||
appear atomic, as will individual atomic primitives.
|
||||
Sequences of operations performed under a lock will -not-
|
||||
Sequences of operations performed under a lock will *not*
|
||||
appear to be atomic to RCU readers, nor will sequences
|
||||
of multiple atomic primitives.
|
||||
|
||||
@ -333,7 +333,7 @@ over a rather long period of time, but improvements are always welcome!
|
||||
for example) may be omitted.
|
||||
|
||||
10. Conversely, if you are in an RCU read-side critical section,
|
||||
and you don't hold the appropriate update-side lock, you -must-
|
||||
and you don't hold the appropriate update-side lock, you *must*
|
||||
use the "_rcu()" variants of the list macros. Failing to do so
|
||||
will break Alpha, cause aggressive compilers to generate bad code,
|
||||
and confuse people trying to read your code.
|
||||
@ -359,12 +359,12 @@ over a rather long period of time, but improvements are always welcome!
|
||||
callback pending, then that RCU callback will execute on some
|
||||
surviving CPU. (If this was not the case, a self-spawning RCU
|
||||
callback would prevent the victim CPU from ever going offline.)
|
||||
Furthermore, CPUs designated by rcu_nocbs= might well -always-
|
||||
Furthermore, CPUs designated by rcu_nocbs= might well *always*
|
||||
have their RCU callbacks executed on some other CPUs, in fact,
|
||||
for some real-time workloads, this is the whole point of using
|
||||
the rcu_nocbs= kernel boot parameter.
|
||||
|
||||
13. Unlike other forms of RCU, it -is- permissible to block in an
|
||||
13. Unlike other forms of RCU, it *is* permissible to block in an
|
||||
SRCU read-side critical section (demarked by srcu_read_lock()
|
||||
and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
|
||||
Please note that if you don't need to sleep in read-side critical
|
||||
@ -411,16 +411,16 @@ over a rather long period of time, but improvements are always welcome!
|
||||
14. The whole point of call_rcu(), synchronize_rcu(), and friends
|
||||
is to wait until all pre-existing readers have finished before
|
||||
carrying out some otherwise-destructive operation. It is
|
||||
therefore critically important to -first- remove any path
|
||||
therefore critically important to *first* remove any path
|
||||
that readers can follow that could be affected by the
|
||||
destructive operation, and -only- -then- invoke call_rcu(),
|
||||
destructive operation, and *only then* invoke call_rcu(),
|
||||
synchronize_rcu(), or friends.
|
||||
|
||||
Because these primitives only wait for pre-existing readers, it
|
||||
is the caller's responsibility to guarantee that any subsequent
|
||||
readers will execute safely.
|
||||
|
||||
15. The various RCU read-side primitives do -not- necessarily contain
|
||||
15. The various RCU read-side primitives do *not* necessarily contain
|
||||
memory barriers. You should therefore plan for the CPU
|
||||
and the compiler to freely reorder code into and out of RCU
|
||||
read-side critical sections. It is the responsibility of the
|
||||
@ -459,8 +459,8 @@ over a rather long period of time, but improvements are always welcome!
|
||||
pass in a function defined within a loadable module, then it in
|
||||
necessary to wait for all pending callbacks to be invoked after
|
||||
the last invocation and before unloading that module. Note that
|
||||
it is absolutely -not- sufficient to wait for a grace period!
|
||||
The current (say) synchronize_rcu() implementation is -not-
|
||||
it is absolutely *not* sufficient to wait for a grace period!
|
||||
The current (say) synchronize_rcu() implementation is *not*
|
||||
guaranteed to wait for callbacks registered on other CPUs.
|
||||
Or even on the current CPU if that CPU recently went offline
|
||||
and came back online.
|
||||
@ -470,7 +470,7 @@ over a rather long period of time, but improvements are always welcome!
|
||||
- call_rcu() -> rcu_barrier()
|
||||
- call_srcu() -> srcu_barrier()
|
||||
|
||||
However, these barrier functions are absolutely -not- guaranteed
|
||||
However, these barrier functions are absolutely *not* guaranteed
|
||||
to wait for a grace period. In fact, if there are no call_rcu()
|
||||
callbacks waiting anywhere in the system, rcu_barrier() is within
|
||||
its rights to return immediately.
|
||||
|
@ -43,7 +43,7 @@ Follow these rules to keep your RCU code working properly:
|
||||
- Set bits and clear bits down in the must-be-zero low-order
|
||||
bits of that pointer. This clearly means that the pointer
|
||||
must have alignment constraints, for example, this does
|
||||
-not- work in general for char* pointers.
|
||||
*not* work in general for char* pointers.
|
||||
|
||||
- XOR bits to translate pointers, as is done in some
|
||||
classic buddy-allocator algorithms.
|
||||
@ -174,7 +174,7 @@ Follow these rules to keep your RCU code working properly:
|
||||
Please see the "CONTROL DEPENDENCIES" section of
|
||||
Documentation/memory-barriers.txt for more details.
|
||||
|
||||
- The pointers are not equal -and- the compiler does
|
||||
- The pointers are not equal *and* the compiler does
|
||||
not have enough information to deduce the value of the
|
||||
pointer. Note that the volatile cast in rcu_dereference()
|
||||
will normally prevent the compiler from knowing too much.
|
||||
@ -360,7 +360,7 @@ in turn destroying the ordering between this load and the loads of the
|
||||
return values. This can result in "p->b" returning pre-initialization
|
||||
garbage values.
|
||||
|
||||
In short, rcu_dereference() is -not- optional when you are going to
|
||||
In short, rcu_dereference() is *not* optional when you are going to
|
||||
dereference the resulting pointer.
|
||||
|
||||
|
||||
|
@ -32,7 +32,7 @@ warnings:
|
||||
|
||||
- Booting Linux using a console connection that is too slow to
|
||||
keep up with the boot-time console-message rate. For example,
|
||||
a 115Kbaud serial console can be -way- too slow to keep up
|
||||
a 115Kbaud serial console can be *way* too slow to keep up
|
||||
with boot-time message rates, and will frequently result in
|
||||
RCU CPU stall warning messages. Especially if you have added
|
||||
debug printk()s.
|
||||
@ -105,7 +105,7 @@ warnings:
|
||||
leading the realization that the CPU had failed.
|
||||
|
||||
The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning.
|
||||
Note that SRCU does -not- have CPU stall warnings. Please note that
|
||||
Note that SRCU does *not* have CPU stall warnings. Please note that
|
||||
RCU only detects CPU stalls when there is a grace period in progress.
|
||||
No grace period, no CPU stall warnings.
|
||||
|
||||
@ -145,7 +145,7 @@ CONFIG_RCU_CPU_STALL_TIMEOUT
|
||||
this parameter is checked only at the beginning of a cycle.
|
||||
So if you are 10 seconds into a 40-second stall, setting this
|
||||
sysfs parameter to (say) five will shorten the timeout for the
|
||||
-next- stall, or the following warning for the current stall
|
||||
*next* stall, or the following warning for the current stall
|
||||
(assuming the stall lasts long enough). It will not affect the
|
||||
timing of the next warning for the current stall.
|
||||
|
||||
@ -189,8 +189,8 @@ rcupdate.rcu_task_stall_timeout
|
||||
Interpreting RCU's CPU Stall-Detector "Splats"
|
||||
==============================================
|
||||
|
||||
For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
|
||||
it will print a message similar to the following::
|
||||
For non-RCU-tasks flavors of RCU, when a CPU detects that some other
|
||||
CPU is stalling, it will print a message similar to the following::
|
||||
|
||||
INFO: rcu_sched detected stalls on CPUs/tasks:
|
||||
2-...: (3 GPs behind) idle=06c/0/0 softirq=1453/1455 fqs=0
|
||||
@ -202,8 +202,10 @@ causing stalls, and that the stall was affecting RCU-sched. This message
|
||||
will normally be followed by stack dumps for each CPU. Please note that
|
||||
PREEMPT_RCU builds can be stalled by tasks as well as by CPUs, and that
|
||||
the tasks will be indicated by PID, for example, "P3421". It is even
|
||||
possible for an rcu_state stall to be caused by both CPUs -and- tasks,
|
||||
possible for an rcu_state stall to be caused by both CPUs *and* tasks,
|
||||
in which case the offending CPUs and tasks will all be called out in the list.
|
||||
In some cases, CPUs will detect themselves stalling, which will result
|
||||
in a self-detected stall.
|
||||
|
||||
CPU 2's "(3 GPs behind)" indicates that this CPU has not interacted with
|
||||
the RCU core for the past three grace periods. In contrast, CPU 16's "(0
|
||||
@ -224,7 +226,7 @@ is the number that had executed since boot at the time that this CPU
|
||||
last noted the beginning of a grace period, which might be the current
|
||||
(stalled) grace period, or it might be some earlier grace period (for
|
||||
example, if the CPU might have been in dyntick-idle mode for an extended
|
||||
time period. The number after the "/" is the number that have executed
|
||||
time period). The number after the "/" is the number that have executed
|
||||
since boot until the current time. If this latter number stays constant
|
||||
across repeated stall-warning messages, it is possible that RCU's softirq
|
||||
handlers are no longer able to execute on this CPU. This can happen if
|
||||
@ -283,7 +285,8 @@ If the relevant grace-period kthread has been unable to run prior to
|
||||
the stall warning, as was the case in the "All QSes seen" line above,
|
||||
the following additional line is printed::
|
||||
|
||||
kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5
|
||||
rcu_sched kthread starved for 23807 jiffies! g7075 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 ->cpu=5
|
||||
Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
|
||||
|
||||
Starving the grace-period kthreads of CPU time can of course result
|
||||
in RCU CPU stall warnings even when all CPUs and tasks have passed
|
||||
@ -313,15 +316,21 @@ is the current ``TIMER_SOFTIRQ`` count on cpu 4. If this value does not
|
||||
change on successive RCU CPU stall warnings, there is further reason to
|
||||
suspect a timer problem.
|
||||
|
||||
These messages are usually followed by stack dumps of the CPUs and tasks
|
||||
involved in the stall. These stack traces can help you locate the cause
|
||||
of the stall, keeping in mind that the CPU detecting the stall will have
|
||||
an interrupt frame that is mainly devoted to detecting the stall.
|
||||
|
||||
|
||||
Multiple Warnings From One Stall
|
||||
================================
|
||||
|
||||
If a stall lasts long enough, multiple stall-warning messages will be
|
||||
printed for it. The second and subsequent messages are printed at
|
||||
If a stall lasts long enough, multiple stall-warning messages will
|
||||
be printed for it. The second and subsequent messages are printed at
|
||||
longer intervals, so that the time between (say) the first and second
|
||||
message will be about three times the interval between the beginning
|
||||
of the stall and the first message.
|
||||
of the stall and the first message. It can be helpful to compare the
|
||||
stack dumps for the different messages for the same stalled grace period.
|
||||
|
||||
|
||||
Stall Warnings for Expedited Grace Periods
|
||||
|
@ -259,7 +259,7 @@ Configuring the kernel
|
||||
Compiling the kernel
|
||||
--------------------
|
||||
|
||||
- Make sure you have at least gcc 4.9 available.
|
||||
- Make sure you have at least gcc 5.1 available.
|
||||
For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
|
||||
|
||||
Please note that you can still run a.out user programs with this kernel.
|
||||
|
@ -30,22 +30,21 @@ following ASL code can be used::
|
||||
{
|
||||
Device (STAC)
|
||||
{
|
||||
Name (_ADR, Zero)
|
||||
Name (_HID, "BMA222E")
|
||||
Name (RBUF, ResourceTemplate ()
|
||||
{
|
||||
I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
|
||||
AddressingMode7Bit, "\\_SB.I2C6", 0x00,
|
||||
ResourceConsumer, ,)
|
||||
GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
|
||||
"\\_SB.GPO2", 0x00, ResourceConsumer, , )
|
||||
{ // Pin list
|
||||
0
|
||||
}
|
||||
})
|
||||
|
||||
Method (_CRS, 0, Serialized)
|
||||
{
|
||||
Name (RBUF, ResourceTemplate ()
|
||||
{
|
||||
I2cSerialBus (0x0018, ControllerInitiated, 0x00061A80,
|
||||
AddressingMode7Bit, "\\_SB.I2C6", 0x00,
|
||||
ResourceConsumer, ,)
|
||||
GpioInt (Edge, ActiveHigh, Exclusive, PullDown, 0x0000,
|
||||
"\\_SB.GPO2", 0x00, ResourceConsumer, , )
|
||||
{ // Pin list
|
||||
0
|
||||
}
|
||||
})
|
||||
Return (RBUF)
|
||||
}
|
||||
}
|
||||
@ -75,7 +74,7 @@ This option allows loading of user defined SSDTs from initrd and it is useful
|
||||
when the system does not support EFI or when there is not enough EFI storage.
|
||||
|
||||
It works in a similar way with initrd based ACPI tables override/upgrade: SSDT
|
||||
aml code must be placed in the first, uncompressed, initrd under the
|
||||
AML code must be placed in the first, uncompressed, initrd under the
|
||||
"kernel/firmware/acpi" path. Multiple files can be used and this will translate
|
||||
in loading multiple tables. Only SSDT and OEM tables are allowed. See
|
||||
initrd_table_override.txt for more details.
|
||||
@ -103,12 +102,14 @@ This is the preferred method, when EFI is supported on the platform, because it
|
||||
allows a persistent, OS independent way of storing the user defined SSDTs. There
|
||||
is also work underway to implement EFI support for loading user defined SSDTs
|
||||
and using this method will make it easier to convert to the EFI loading
|
||||
mechanism when that will arrive.
|
||||
mechanism when that will arrive. To enable it, the
|
||||
CONFIG_EFI_CUSTOM_SSDT_OVERLAYS shoyld be chosen to y.
|
||||
|
||||
In order to load SSDTs from an EFI variable the efivar_ssdt kernel command line
|
||||
parameter can be used. The argument for the option is the variable name to
|
||||
use. If there are multiple variables with the same name but with different
|
||||
vendor GUIDs, all of them will be loaded.
|
||||
In order to load SSDTs from an EFI variable the ``"efivar_ssdt=..."`` kernel
|
||||
command line parameter can be used (the name has a limitation of 16 characters).
|
||||
The argument for the option is the variable name to use. If there are multiple
|
||||
variables with the same name but with different vendor GUIDs, all of them will
|
||||
be loaded.
|
||||
|
||||
In order to store the AML code in an EFI variable the efivarfs filesystem can be
|
||||
used. It is enabled and mounted by default in /sys/firmware/efi/efivars in all
|
||||
@ -127,7 +128,7 @@ variable with the content from a given file::
|
||||
|
||||
#!/bin/sh -e
|
||||
|
||||
while ! [ -z "$1" ]; do
|
||||
while [ -n "$1" ]; do
|
||||
case "$1" in
|
||||
"-f") filename="$2"; shift;;
|
||||
"-g") guid="$2"; shift;;
|
||||
@ -167,14 +168,14 @@ variable with the content from a given file::
|
||||
Loading ACPI SSDTs from configfs
|
||||
================================
|
||||
|
||||
This option allows loading of user defined SSDTs from userspace via the configfs
|
||||
This option allows loading of user defined SSDTs from user space via the configfs
|
||||
interface. The CONFIG_ACPI_CONFIGFS option must be select and configfs must be
|
||||
mounted. In the following examples, we assume that configfs has been mounted in
|
||||
/config.
|
||||
/sys/kernel/config.
|
||||
|
||||
New tables can be loading by creating new directories in /config/acpi/table/ and
|
||||
writing the SSDT aml code in the aml attribute::
|
||||
New tables can be loading by creating new directories in /sys/kernel/config/acpi/table
|
||||
and writing the SSDT AML code in the aml attribute::
|
||||
|
||||
cd /config/acpi/table
|
||||
cd /sys/kernel/config/acpi/table
|
||||
mkdir my_ssdt
|
||||
cat ~/ssdt.aml > my_ssdt/aml
|
||||
|
@ -72,3 +72,16 @@ that the `rm() <rm_>`_ tool can be used to delete them. Note that the
|
||||
``binder-control`` device cannot be deleted since this would make the binderfs
|
||||
instance unusable. The ``binder-control`` device will be deleted when the
|
||||
binderfs instance is unmounted and all references to it have been dropped.
|
||||
|
||||
Binder features
|
||||
---------------
|
||||
|
||||
Assuming an instance of binderfs has been mounted at ``/dev/binderfs``, the
|
||||
features supported by the binder driver can be located under
|
||||
``/dev/binderfs/features/``. The presence of individual files can be tested
|
||||
to determine whether a particular feature is supported by the driver.
|
||||
|
||||
Example::
|
||||
|
||||
cat /dev/binderfs/features/oneway_spam_detection
|
||||
1
|
||||
|
@ -178,7 +178,7 @@ update the boot loader and the kernel image itself as long as the boot
|
||||
loader passes the correct initrd file size. If by any chance, the boot
|
||||
loader passes a longer size, the kernel fails to find the bootconfig data.
|
||||
|
||||
To do this operation, Linux kernel provides "bootconfig" command under
|
||||
To do this operation, Linux kernel provides ``bootconfig`` command under
|
||||
tools/bootconfig, which allows admin to apply or delete the config file
|
||||
to/from initrd image. You can build it by the following command::
|
||||
|
||||
@ -196,6 +196,43 @@ To remove the config from the image, you can use -d option as below::
|
||||
Then add "bootconfig" on the normal kernel command line to tell the
|
||||
kernel to look for the bootconfig at the end of the initrd file.
|
||||
|
||||
|
||||
Kernel parameters via Boot Config
|
||||
=================================
|
||||
|
||||
In addition to the kernel command line, the boot config can be used for
|
||||
passing the kernel parameters. All the key-value pairs under ``kernel``
|
||||
key will be passed to kernel cmdline directly. Moreover, the key-value
|
||||
pairs under ``init`` will be passed to init process via the cmdline.
|
||||
The parameters are concatinated with user-given kernel cmdline string
|
||||
as the following order, so that the command line parameter can override
|
||||
bootconfig parameters (this depends on how the subsystem handles parameters
|
||||
but in general, earlier parameter will be overwritten by later one.)::
|
||||
|
||||
[bootconfig params][cmdline params] -- [bootconfig init params][cmdline init params]
|
||||
|
||||
Here is an example of the bootconfig file for kernel/init parameters.::
|
||||
|
||||
kernel {
|
||||
root = 01234567-89ab-cdef-0123-456789abcd
|
||||
}
|
||||
init {
|
||||
splash
|
||||
}
|
||||
|
||||
This will be copied into the kernel cmdline string as the following::
|
||||
|
||||
root="01234567-89ab-cdef-0123-456789abcd" -- splash
|
||||
|
||||
If user gives some other command line like,::
|
||||
|
||||
ro bootconfig -- quiet
|
||||
|
||||
The final kernel cmdline will be the following::
|
||||
|
||||
root="01234567-89ab-cdef-0123-456789abcd" ro bootconfig -- splash quiet
|
||||
|
||||
|
||||
Config File Limitation
|
||||
======================
|
||||
|
||||
|
@ -2056,6 +2056,17 @@ Cpuset Interface Files
|
||||
The value of "cpuset.mems" stays constant until the next update
|
||||
and won't be affected by any memory nodes hotplug events.
|
||||
|
||||
Setting a non-empty value to "cpuset.mems" causes memory of
|
||||
tasks within the cgroup to be migrated to the designated nodes if
|
||||
they are currently using memory outside of the designated nodes.
|
||||
|
||||
There is a cost for this memory migration. The migration
|
||||
may not be complete and some memory pages may be left behind.
|
||||
So it is recommended that "cpuset.mems" should be set properly
|
||||
before spawning new tasks into the cpuset. Even if there is
|
||||
a need to change "cpuset.mems" with active tasks, it shouldn't
|
||||
be done frequently.
|
||||
|
||||
cpuset.mems.effective
|
||||
A read-only multiple values file which exists on all
|
||||
cpuset-enabled cgroups.
|
||||
|
@ -58,9 +58,9 @@ source for the output is in brackets ("[]").
|
||||
[NR_CPUS-1]
|
||||
|
||||
offline: CPUs that are not online because they have been
|
||||
HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
|
||||
of CPUs allowed by the kernel configuration (kernel_max
|
||||
above). [~cpu_online_mask + cpus >= NR_CPUS]
|
||||
HOTPLUGGED off or exceed the limit of CPUs allowed by the
|
||||
kernel configuration (kernel_max above).
|
||||
[~cpu_online_mask + cpus >= NR_CPUS]
|
||||
|
||||
online: CPUs that are online and being scheduled [cpu_online_mask]
|
||||
|
||||
@ -96,5 +96,5 @@ online.)::
|
||||
possible: 0-127
|
||||
present: 0-3
|
||||
|
||||
See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
|
||||
as well as more information on the various cpumasks.
|
||||
See Documentation/core-api/cpu_hotplug.rst for the possible_cpus=NUM
|
||||
kernel start parameter as well as more information on the various cpumasks.
|
||||
|
715
Documentation/admin-guide/device-mapper/dm-ima.rst
Normal file
715
Documentation/admin-guide/device-mapper/dm-ima.rst
Normal file
@ -0,0 +1,715 @@
|
||||
======
|
||||
dm-ima
|
||||
======
|
||||
|
||||
For a given system, various external services/infrastructure tools
|
||||
(including the attestation service) interact with it - both during the
|
||||
setup and during rest of the system run-time. They share sensitive data
|
||||
and/or execute critical workload on that system. The external services
|
||||
may want to verify the current run-time state of the relevant kernel
|
||||
subsystems before fully trusting the system with business-critical
|
||||
data/workload.
|
||||
|
||||
Device mapper plays a critical role on a given system by providing
|
||||
various important functionalities to the block devices using various
|
||||
target types like crypt, verity, integrity etc. Each of these target
|
||||
types’ functionalities can be configured with various attributes.
|
||||
The attributes chosen to configure these target types can significantly
|
||||
impact the security profile of the block device, and in-turn, of the
|
||||
system itself. For instance, the type of encryption algorithm and the
|
||||
key size determines the strength of encryption for a given block device.
|
||||
|
||||
Therefore, verifying the current state of various block devices as well
|
||||
as their various target attributes is crucial for external services before
|
||||
fully trusting the system with business-critical data/workload.
|
||||
|
||||
IMA kernel subsystem provides the necessary functionality for
|
||||
device mapper to measure the state and configuration of
|
||||
various block devices -
|
||||
|
||||
- by device mapper itself, from within the kernel,
|
||||
- in a tamper resistant way,
|
||||
- and re-measured - triggered on state/configuration change.
|
||||
|
||||
Setting the IMA Policy:
|
||||
=======================
|
||||
For IMA to measure the data on a given system, the IMA policy on the
|
||||
system needs to be updated to have following line, and the system needs
|
||||
to be restarted for the measurements to take effect.
|
||||
|
||||
::
|
||||
|
||||
/etc/ima/ima-policy
|
||||
measure func=CRITICAL_DATA label=device-mapper template=ima-buf
|
||||
|
||||
The measurements will be reflected in the IMA logs, which are located at:
|
||||
|
||||
::
|
||||
|
||||
/sys/kernel/security/integrity/ima/ascii_runtime_measurements
|
||||
/sys/kernel/security/integrity/ima/binary_runtime_measurements
|
||||
|
||||
Then IMA ASCII measurement log has the following format:
|
||||
|
||||
::
|
||||
|
||||
<PCR> <TEMPLATE_DATA_DIGEST> <TEMPLATE_NAME> <TEMPLATE_DATA>
|
||||
|
||||
PCR := Platform Configuration Register, in which the values are registered.
|
||||
This is applicable if TPM chip is in use.
|
||||
|
||||
TEMPLATE_DATA_DIGEST := Template data digest of the IMA record.
|
||||
TEMPLATE_NAME := Template name that registered the integrity value (e.g. ima-buf).
|
||||
|
||||
TEMPLATE_DATA := <ALG> ":" <EVENT_DIGEST> <EVENT_NAME> <EVENT_DATA>
|
||||
It contains data for the specific event to be measured,
|
||||
in a given template data format.
|
||||
|
||||
ALG := Algorithm to compute event digest
|
||||
EVENT_DIGEST := Digest of the event data
|
||||
EVENT_NAME := Description of the event (e.g. 'dm_table_load').
|
||||
EVENT_DATA := The event data to be measured.
|
||||
|
||||
|
|
||||
|
||||
| *NOTE #1:*
|
||||
| The DM target data measured by IMA subsystem can alternatively
|
||||
be queried from userspace by setting DM_IMA_MEASUREMENT_FLAG with
|
||||
DM_TABLE_STATUS_CMD.
|
||||
|
||||
|
|
||||
|
||||
| *NOTE #2:*
|
||||
| The Kernel configuration CONFIG_IMA_DISABLE_HTABLE allows measurement of duplicate records.
|
||||
| To support recording duplicate IMA events in the IMA log, the Kernel needs to be configured with
|
||||
CONFIG_IMA_DISABLE_HTABLE=y.
|
||||
|
||||
Supported Device States:
|
||||
========================
|
||||
Following device state changes will trigger IMA measurements:
|
||||
|
||||
1. Table load
|
||||
#. Device resume
|
||||
#. Device remove
|
||||
#. Table clear
|
||||
#. Device rename
|
||||
|
||||
1. Table load:
|
||||
---------------
|
||||
When a new table is loaded in a device's inactive table slot,
|
||||
the device information and target specific details from the
|
||||
targets in the table are measured.
|
||||
|
||||
The IMA measurement log has the following format for 'dm_table_load':
|
||||
|
||||
::
|
||||
|
||||
EVENT_NAME := "dm_table_load"
|
||||
EVENT_DATA := <dm_version_str> ";" <device_metadata> ";" <table_load_data>
|
||||
|
||||
dm_version_str := "dm_version=" <N> "." <N> "." <N>
|
||||
Same as Device Mapper driver version.
|
||||
device_metadata := <device_name> "," <device_uuid> "," <device_major> "," <device_minor> ","
|
||||
<minor_count> "," <num_device_targets> ";"
|
||||
|
||||
device_name := "name=" <dm-device-name>
|
||||
device_uuid := "uuid=" <dm-device-uuid>
|
||||
device_major := "major=" <N>
|
||||
device_minor := "minor=" <N>
|
||||
minor_count := "minor_count=" <N>
|
||||
num_device_targets := "num_targets=" <N>
|
||||
dm-device-name := Name of the device. If it contains special characters like '\', ',', ';',
|
||||
they are prefixed with '\'.
|
||||
dm-device-uuid := UUID of the device. If it contains special characters like '\', ',', ';',
|
||||
they are prefixed with '\'.
|
||||
|
||||
table_load_data := <target_data>
|
||||
Represents the data (as name=value pairs) from various targets in the table,
|
||||
which is being loaded into the DM device's inactive table slot.
|
||||
target_data := <target_data_row> | <target_data><target_data_row>
|
||||
|
||||
target_data_row := <target_index> "," <target_begin> "," <target_len> "," <target_name> ","
|
||||
<target_version> "," <target_attributes> ";"
|
||||
target_index := "target_index=" <N>
|
||||
Represents nth target in the table (from 0 to N-1 targets specified in <num_device_targets>)
|
||||
If all the data for N targets doesn't fit in the given buffer - then the data that fits
|
||||
in the buffer (say from target 0 to x) is measured in a given IMA event.
|
||||
The remaining data from targets x+1 to N-1 is measured in the subsequent IMA events,
|
||||
with the same format as that of 'dm_table_load'
|
||||
i.e. <dm_version_str> ";" <device_metadata> ";" <table_load_data>.
|
||||
|
||||
target_begin := "target_begin=" <N>
|
||||
target_len := "target_len=" <N>
|
||||
target_name := Name of the target. 'linear', 'crypt', 'integrity' etc.
|
||||
The targets that are supported for IMA measurements are documented below in the
|
||||
'Supported targets' section.
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
target_attributes := Data containing comma separated list of name=value pairs of target specific attributes.
|
||||
|
||||
For instance, if a linear device is created with the following table entries,
|
||||
# dmsetup create linear1
|
||||
0 2 linear /dev/loop0 512
|
||||
2 2 linear /dev/loop0 512
|
||||
4 2 linear /dev/loop0 512
|
||||
6 2 linear /dev/loop0 512
|
||||
|
||||
Then IMA ASCII measurement log will have the following entry:
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
10 a8c5ff755561c7a28146389d1514c318592af49a ima-buf sha256:4d73481ecce5eadba8ab084640d85bb9ca899af4d0a122989252a76efadc5b72
|
||||
dm_table_load
|
||||
dm_version=4.45.0;
|
||||
name=linear1,uuid=,major=253,minor=0,minor_count=1,num_targets=4;
|
||||
target_index=0,target_begin=0,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
|
||||
target_index=1,target_begin=2,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
|
||||
target_index=2,target_begin=4,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
|
||||
target_index=3,target_begin=6,target_len=2,target_name=linear,target_version=1.4.0,device_name=7:0,start=512;
|
||||
|
||||
2. Device resume:
|
||||
------------------
|
||||
When a suspended device is resumed, the device information and the hash of the
|
||||
data from previous load of an active table are measured.
|
||||
|
||||
The IMA measurement log has the following format for 'dm_device_resume':
|
||||
|
||||
::
|
||||
|
||||
EVENT_NAME := "dm_device_resume"
|
||||
EVENT_DATA := <dm_version_str> ";" <device_metadata> ";" <active_table_hash> ";" <current_device_capacity> ";"
|
||||
|
||||
dm_version_str := As described in the 'Table load' section above.
|
||||
device_metadata := As described in the 'Table load' section above.
|
||||
active_table_hash := "active_table_hash=" <table_hash_alg> ":" <table_hash>
|
||||
Rerpresents the hash of the IMA data being measured for the
|
||||
active table for the device.
|
||||
table_hash_alg := Algorithm used to compute the hash.
|
||||
table_hash := Hash of the (<dm_version_str> ";" <device_metadata> ";" <table_load_data> ";")
|
||||
as described in the 'dm_table_load' above.
|
||||
Note: If the table_load data spans across multiple IMA 'dm_table_load'
|
||||
events for a given device, the hash is computed combining all the event data
|
||||
i.e. (<dm_version_str> ";" <device_metadata> ";" <table_load_data> ";")
|
||||
across all those events.
|
||||
current_device_capacity := "current_device_capacity=" <N>
|
||||
|
||||
For instance, if a linear device is resumed with the following command,
|
||||
#dmsetup resume linear1
|
||||
|
||||
then IMA ASCII measurement log will have an entry with:
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
10 56c00cc062ffc24ccd9ac2d67d194af3282b934e ima-buf sha256:e7d12c03b958b4e0e53e7363a06376be88d98a1ac191fdbd3baf5e4b77f329b6
|
||||
dm_device_resume
|
||||
dm_version=4.45.0;
|
||||
name=linear1,uuid=,major=253,minor=0,minor_count=1,num_targets=4;
|
||||
active_table_hash=sha256:4d73481ecce5eadba8ab084640d85bb9ca899af4d0a122989252a76efadc5b72;current_device_capacity=8;
|
||||
|
||||
3. Device remove:
|
||||
------------------
|
||||
When a device is removed, the device information and a sha256 hash of the
|
||||
data from an active and inactive table are measured.
|
||||
|
||||
The IMA measurement log has the following format for 'dm_device_remove':
|
||||
|
||||
::
|
||||
|
||||
EVENT_NAME := "dm_device_remove"
|
||||
EVENT_DATA := <dm_version_str> ";" <device_active_metadata> ";" <device_inactive_metadata> ";"
|
||||
<active_table_hash> "," <inactive_table_hash> "," <remove_all> ";" <current_device_capacity> ";"
|
||||
|
||||
dm_version_str := As described in the 'Table load' section above.
|
||||
device_active_metadata := Device metadata that reflects the currently loaded active table.
|
||||
The format is same as 'device_metadata' described in the 'Table load' section above.
|
||||
device_inactive_metadata := Device metadata that reflects the inactive table.
|
||||
The format is same as 'device_metadata' described in the 'Table load' section above.
|
||||
active_table_hash := Hash of the currently loaded active table.
|
||||
The format is same as 'active_table_hash' described in the 'Device resume' section above.
|
||||
inactive_table_hash := Hash of the inactive table.
|
||||
The format is same as 'active_table_hash' described in the 'Device resume' section above.
|
||||
remove_all := "remove_all=" <yes_no>
|
||||
yes_no := "y" | "n"
|
||||
current_device_capacity := "current_device_capacity=" <N>
|
||||
|
||||
For instance, if a linear device is removed with the following command,
|
||||
#dmsetup remove l1
|
||||
|
||||
then IMA ASCII measurement log will have the following entry:
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
10 790e830a3a7a31590824ac0642b3b31c2d0e8b38 ima-buf sha256:ab9f3c959367a8f5d4403d6ce9c3627dadfa8f9f0e7ec7899299782388de3840
|
||||
dm_device_remove
|
||||
dm_version=4.45.0;
|
||||
device_active_metadata=name=l1,uuid=,major=253,minor=2,minor_count=1,num_targets=2;
|
||||
device_inactive_metadata=name=l1,uuid=,major=253,minor=2,minor_count=1,num_targets=1;
|
||||
active_table_hash=sha256:4a7e62efaebfc86af755831998b7db6f59b60d23c9534fb16a4455907957953a,
|
||||
inactive_table_hash=sha256:9d79c175bc2302d55a183e8f50ad4bafd60f7692fd6249e5fd213e2464384b86,remove_all=n;
|
||||
current_device_capacity=2048;
|
||||
|
||||
4. Table clear:
|
||||
----------------
|
||||
When an inactive table is cleared from the device, the device information and a sha256 hash of the
|
||||
data from an inactive table are measured.
|
||||
|
||||
The IMA measurement log has the following format for 'dm_table_clear':
|
||||
|
||||
::
|
||||
|
||||
EVENT_NAME := "dm_table_clear"
|
||||
EVENT_DATA := <dm_version_str> ";" <device_inactive_metadata> ";" <inactive_table_hash> ";" <current_device_capacity> ";"
|
||||
|
||||
dm_version_str := As described in the 'Table load' section above.
|
||||
device_inactive_metadata := Device metadata that was captured during the load time inactive table being cleared.
|
||||
The format is same as 'device_metadata' described in the 'Table load' section above.
|
||||
inactive_table_hash := Hash of the inactive table being cleared from the device.
|
||||
The format is same as 'active_table_hash' described in the 'Device resume' section above.
|
||||
current_device_capacity := "current_device_capacity=" <N>
|
||||
|
||||
For instance, if a linear device's inactive table is cleared,
|
||||
#dmsetup clear l1
|
||||
|
||||
then IMA ASCII measurement log will have an entry with:
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
10 77d347408f557f68f0041acb0072946bb2367fe5 ima-buf sha256:42f9ca22163fdfa548e6229dece2959bc5ce295c681644240035827ada0e1db5
|
||||
dm_table_clear
|
||||
dm_version=4.45.0;
|
||||
name=l1,uuid=,major=253,minor=2,minor_count=1,num_targets=1;
|
||||
inactive_table_hash=sha256:75c0dc347063bf474d28a9907037eba060bfe39d8847fc0646d75e149045d545;current_device_capacity=1024;
|
||||
|
||||
5. Device rename:
|
||||
------------------
|
||||
When an device's NAME or UUID is changed, the device information and the new NAME and UUID
|
||||
are measured.
|
||||
|
||||
The IMA measurement log has the following format for 'dm_device_rename':
|
||||
|
||||
::
|
||||
|
||||
EVENT_NAME := "dm_device_rename"
|
||||
EVENT_DATA := <dm_version_str> ";" <device_active_metadata> ";" <new_device_name> "," <new_device_uuid> ";" <current_device_capacity> ";"
|
||||
|
||||
dm_version_str := As described in the 'Table load' section above.
|
||||
device_active_metadata := Device metadata that reflects the currently loaded active table.
|
||||
The format is same as 'device_metadata' described in the 'Table load' section above.
|
||||
new_device_name := "new_name=" <dm-device-name>
|
||||
dm-device-name := Same as <dm-device-name> described in 'Table load' section above
|
||||
new_device_uuid := "new_uuid=" <dm-device-uuid>
|
||||
dm-device-uuid := Same as <dm-device-uuid> described in 'Table load' section above
|
||||
current_device_capacity := "current_device_capacity=" <N>
|
||||
|
||||
E.g 1: if a linear device's name is changed with the following command,
|
||||
#dmsetup rename linear1 --setuuid 1234-5678
|
||||
|
||||
then IMA ASCII measurement log will have an entry with:
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
10 8b0423209b4c66ac1523f4c9848c9b51ee332f48 ima-buf sha256:6847b7258134189531db593e9230b257c84f04038b5a18fd2e1473860e0569ac
|
||||
dm_device_rename
|
||||
dm_version=4.45.0;
|
||||
name=linear1,uuid=,major=253,minor=2,minor_count=1,num_targets=1;new_name=linear1,new_uuid=1234-5678;
|
||||
current_device_capacity=1024;
|
||||
|
||||
E.g 2: if a linear device's name is changed with the following command,
|
||||
# dmsetup rename linear1 linear=2
|
||||
|
||||
then IMA ASCII measurement log will have an entry with:
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
10 bef70476b99c2bdf7136fae033aa8627da1bf76f ima-buf sha256:8c6f9f53b9ef9dc8f92a2f2cca8910e622543d0f0d37d484870cb16b95111402
|
||||
dm_device_rename
|
||||
dm_version=4.45.0;
|
||||
name=linear1,uuid=1234-5678,major=253,minor=2,minor_count=1,num_targets=1;
|
||||
new_name=linear\=2,new_uuid=1234-5678;
|
||||
current_device_capacity=1024;
|
||||
|
||||
Supported targets:
|
||||
==================
|
||||
|
||||
Following targets are supported to measure their data using IMA:
|
||||
|
||||
1. cache
|
||||
#. crypt
|
||||
#. integrity
|
||||
#. linear
|
||||
#. mirror
|
||||
#. multipath
|
||||
#. raid
|
||||
#. snapshot
|
||||
#. striped
|
||||
#. verity
|
||||
|
||||
1. cache
|
||||
---------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'cache' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <metadata_mode> "," <cache_metadata_device> ","
|
||||
<cache_device> "," <cache_origin_device> "," <writethrough> "," <writeback> ","
|
||||
<passthrough> "," <no_discard_passdown> ";"
|
||||
|
||||
target_name := "target_name=cache"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
metadata_mode := "metadata_mode=" <cache_metadata_mode>
|
||||
cache_metadata_mode := "fail" | "ro" | "rw"
|
||||
cache_device := "cache_device=" <cache_device_name_string>
|
||||
cache_origin_device := "cache_origin_device=" <cache_origin_device_string>
|
||||
writethrough := "writethrough=" <yes_no>
|
||||
writeback := "writeback=" <yes_no>
|
||||
passthrough := "passthrough=" <yes_no>
|
||||
no_discard_passdown := "no_discard_passdown=" <yes_no>
|
||||
yes_no := "y" | "n"
|
||||
|
||||
E.g.
|
||||
When a 'cache' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'cache' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;name=cache1,uuid=cache_uuid,major=253,minor=2,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=28672,target_name=cache,target_version=2.2.0,metadata_mode=rw,
|
||||
cache_metadata_device=253:4,cache_device=253:3,cache_origin_device=253:5,writethrough=y,writeback=n,
|
||||
passthrough=n,metadata2=y,no_discard_passdown=n;
|
||||
|
||||
|
||||
2. crypt
|
||||
---------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'crypt' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <allow_discards> "," <same_cpu_crypt> ","
|
||||
<submit_from_crypt_cpus> "," <no_read_workqueue> "," <no_write_workqueue> ","
|
||||
<iv_large_sectors> "," <iv_large_sectors> "," [<integrity_tag_size> ","] [<cipher_auth> ","]
|
||||
[<sector_size> ","] [<cipher_string> ","] <key_size> "," <key_parts> ","
|
||||
<key_extra_size> "," <key_mac_size> ";"
|
||||
|
||||
target_name := "target_name=crypt"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
allow_discards := "allow_discards=" <yes_no>
|
||||
same_cpu_crypt := "same_cpu_crypt=" <yes_no>
|
||||
submit_from_crypt_cpus := "submit_from_crypt_cpus=" <yes_no>
|
||||
no_read_workqueue := "no_read_workqueue=" <yes_no>
|
||||
no_write_workqueue := "no_write_workqueue=" <yes_no>
|
||||
iv_large_sectors := "iv_large_sectors=" <yes_no>
|
||||
integrity_tag_size := "integrity_tag_size=" <N>
|
||||
cipher_auth := "cipher_auth=" <string>
|
||||
sector_size := "sector_size=" <N>
|
||||
cipher_string := "cipher_string="
|
||||
key_size := "key_size=" <N>
|
||||
key_parts := "key_parts=" <N>
|
||||
key_extra_size := "key_extra_size=" <N>
|
||||
key_mac_size := "key_mac_size=" <N>
|
||||
yes_no := "y" | "n"
|
||||
|
||||
E.g.
|
||||
When a 'crypt' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'crypt' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=crypt1,uuid=crypt_uuid1,major=253,minor=0,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=1953125,target_name=crypt,target_version=1.23.0,
|
||||
allow_discards=y,same_cpu=n,submit_from_crypt_cpus=n,no_read_workqueue=n,no_write_workqueue=n,
|
||||
iv_large_sectors=n,cipher_string=aes-xts-plain64,key_size=32,key_parts=1,key_extra_size=0,key_mac_size=0;
|
||||
|
||||
3. integrity
|
||||
-------------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'integrity' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <dev_name> "," <start>
|
||||
<tag_size> "," <mode> "," [<meta_device> ","] [<block_size> ","] <recalculate> ","
|
||||
<allow_discards> "," <fix_padding> "," <fix_hmac> "," <legacy_recalculate> ","
|
||||
<journal_sectors> "," <interleave_sectors> "," <buffer_sectors> ";"
|
||||
|
||||
target_name := "target_name=integrity"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
dev_name := "dev_name=" <device_name_str>
|
||||
start := "start=" <N>
|
||||
tag_size := "tag_size=" <N>
|
||||
mode := "mode=" <integrity_mode_str>
|
||||
integrity_mode_str := "J" | "B" | "D" | "R"
|
||||
meta_device := "meta_device=" <meta_device_str>
|
||||
block_size := "block_size=" <N>
|
||||
recalculate := "recalculate=" <yes_no>
|
||||
allow_discards := "allow_discards=" <yes_no>
|
||||
fix_padding := "fix_padding=" <yes_no>
|
||||
fix_hmac := "fix_hmac=" <yes_no>
|
||||
legacy_recalculate := "legacy_recalculate=" <yes_no>
|
||||
journal_sectors := "journal_sectors=" <N>
|
||||
interleave_sectors := "interleave_sectors=" <N>
|
||||
buffer_sectors := "buffer_sectors=" <N>
|
||||
yes_no := "y" | "n"
|
||||
|
||||
E.g.
|
||||
When a 'integrity' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'integrity' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=integrity1,uuid=,major=253,minor=1,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=7856,target_name=integrity,target_version=1.10.0,
|
||||
dev_name=253:0,start=0,tag_size=32,mode=J,recalculate=n,allow_discards=n,fix_padding=n,
|
||||
fix_hmac=n,legacy_recalculate=n,journal_sectors=88,interleave_sectors=32768,buffer_sectors=128;
|
||||
|
||||
|
||||
4. linear
|
||||
----------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'linear' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <device_name> <,> <start> ";"
|
||||
|
||||
target_name := "target_name=linear"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
device_name := "device_name=" <linear_device_name_str>
|
||||
start := "start=" <N>
|
||||
|
||||
E.g.
|
||||
When a 'linear' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'linear' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=linear1,uuid=linear_uuid1,major=253,minor=2,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=28672,target_name=linear,target_version=1.4.0,
|
||||
device_name=253:1,start=2048;
|
||||
|
||||
5. mirror
|
||||
----------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'mirror' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <nr_mirrors> ","
|
||||
<mirror_device_data> "," <handle_errors> "," <keep_log> "," <log_type_status> ";"
|
||||
|
||||
target_name := "target_name=mirror"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
nr_mirrors := "nr_mirrors=" <NR>
|
||||
mirror_device_data := <mirror_device_row> | <mirror_device_data><mirror_device_row>
|
||||
mirror_device_row is repeated <NR> times - for <NR> described in <nr_mirrors>.
|
||||
mirror_device_row := <mirror_device_name> "," <mirror_device_status>
|
||||
mirror_device_name := "mirror_device_" <X> "=" <mirror_device_name_str>
|
||||
where <X> ranges from 0 to (<NR> -1) - for <NR> described in <nr_mirrors>.
|
||||
mirror_device_status := "mirror_device_" <X> "_status=" <mirror_device_status_char>
|
||||
where <X> ranges from 0 to (<NR> -1) - for <NR> described in <nr_mirrors>.
|
||||
mirror_device_status_char := "A" | "F" | "D" | "S" | "R" | "U"
|
||||
handle_errors := "handle_errors=" <yes_no>
|
||||
keep_log := "keep_log=" <yes_no>
|
||||
log_type_status := "log_type_status=" <log_type_status_str>
|
||||
yes_no := "y" | "n"
|
||||
|
||||
E.g.
|
||||
When a 'mirror' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'mirror' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=mirror1,uuid=mirror_uuid1,major=253,minor=6,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=2048,target_name=mirror,target_version=1.14.0,nr_mirrors=2,
|
||||
mirror_device_0=253:4,mirror_device_0_status=A,
|
||||
mirror_device_1=253:5,mirror_device_1_status=A,
|
||||
handle_errors=y,keep_log=n,log_type_status=;
|
||||
|
||||
6. multipath
|
||||
-------------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'multipath' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <nr_priority_groups>
|
||||
["," <pg_state> "," <priority_groups> "," <priority_group_paths>] ";"
|
||||
|
||||
target_name := "target_name=multipath"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
nr_priority_groups := "nr_priority_groups=" <NPG>
|
||||
priority_groups := <priority_groups_row>|<priority_groups_row><priority_groups>
|
||||
priority_groups_row := "pg_state_" <X> "=" <pg_state_str> "," "nr_pgpaths_" <X> "=" <NPGP> ","
|
||||
"path_selector_name_" <X> "=" <string> "," <priority_group_paths>
|
||||
where <X> ranges from 0 to (<NPG> -1) - for <NPG> described in <nr_priority_groups>.
|
||||
pg_state_str := "E" | "A" | "D"
|
||||
<priority_group_paths> := <priority_group_paths_row> | <priority_group_paths_row><priority_group_paths>
|
||||
priority_group_paths_row := "path_name_" <X> "_" <Y> "=" <string> "," "is_active_" <X> "_" <Y> "=" <is_active_str>
|
||||
"fail_count_" <X> "_" <Y> "=" <N> "," "path_selector_status_" <X> "_" <Y> "=" <path_selector_status_str>
|
||||
where <X> ranges from 0 to (<NPG> -1) - for <NPG> described in <nr_priority_groups>,
|
||||
and <Y> ranges from 0 to (<NPGP> -1) - for <NPGP> described in <priority_groups_row>.
|
||||
is_active_str := "A" | "F"
|
||||
|
||||
E.g.
|
||||
When a 'multipath' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'multipath' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=mp,uuid=,major=253,minor=0,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=2097152,target_name=multipath,target_version=1.14.0,nr_priority_groups=2,
|
||||
pg_state_0=E,nr_pgpaths_0=2,path_selector_name_0=queue-length,
|
||||
path_name_0_0=8:16,is_active_0_0=A,fail_count_0_0=0,path_selector_status_0_0=,
|
||||
path_name_0_1=8:32,is_active_0_1=A,fail_count_0_1=0,path_selector_status_0_1=,
|
||||
pg_state_1=E,nr_pgpaths_1=2,path_selector_name_1=queue-length,
|
||||
path_name_1_0=8:48,is_active_1_0=A,fail_count_1_0=0,path_selector_status_1_0=,
|
||||
path_name_1_1=8:64,is_active_1_1=A,fail_count_1_1=0,path_selector_status_1_1=;
|
||||
|
||||
7. raid
|
||||
--------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'raid' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <raid_type> "," <raid_disks> "," <raid_state>
|
||||
<raid_device_status> ["," journal_dev_mode] ";"
|
||||
|
||||
target_name := "target_name=raid"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
raid_type := "raid_type=" <raid_type_str>
|
||||
raid_disks := "raid_disks=" <NRD>
|
||||
raid_state := "raid_state=" <raid_state_str>
|
||||
raid_state_str := "frozen" | "reshape" |"resync" | "check" | "repair" | "recover" | "idle" |"undef"
|
||||
raid_device_status := <raid_device_status_row> | <raid_device_status_row><raid_device_status>
|
||||
<raid_device_status_row> is repeated <NRD> times - for <NRD> described in <raid_disks>.
|
||||
raid_device_status_row := "raid_device_" <X> "_status=" <raid_device_status_str>
|
||||
where <X> ranges from 0 to (<NRD> -1) - for <NRD> described in <raid_disks>.
|
||||
raid_device_status_str := "A" | "D" | "a" | "-"
|
||||
journal_dev_mode := "journal_dev_mode=" <journal_dev_mode_str>
|
||||
journal_dev_mode_str := "writethrough" | "writeback" | "invalid"
|
||||
|
||||
E.g.
|
||||
When a 'raid' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'raid' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=raid_LV1,uuid=uuid_raid_LV1,major=253,minor=12,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=2048,target_name=raid,target_version=1.15.1,
|
||||
raid_type=raid10,raid_disks=4,raid_state=idle,
|
||||
raid_device_0_status=A,
|
||||
raid_device_1_status=A,
|
||||
raid_device_2_status=A,
|
||||
raid_device_3_status=A;
|
||||
|
||||
|
||||
8. snapshot
|
||||
------------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'snapshot' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <snap_origin_name> ","
|
||||
<snap_cow_name> "," <snap_valid> "," <snap_merge_failed> "," <snapshot_overflowed> ";"
|
||||
|
||||
target_name := "target_name=snapshot"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
snap_origin_name := "snap_origin_name=" <string>
|
||||
snap_cow_name := "snap_cow_name=" <string>
|
||||
snap_valid := "snap_valid=" <yes_no>
|
||||
snap_merge_failed := "snap_merge_failed=" <yes_no>
|
||||
snapshot_overflowed := "snapshot_overflowed=" <yes_no>
|
||||
yes_no := "y" | "n"
|
||||
|
||||
E.g.
|
||||
When a 'snapshot' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'snapshot' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=snap1,uuid=snap_uuid1,major=253,minor=13,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=4096,target_name=snapshot,target_version=1.16.0,
|
||||
snap_origin_name=253:11,snap_cow_name=253:12,snap_valid=y,snap_merge_failed=n,snapshot_overflowed=n;
|
||||
|
||||
9. striped
|
||||
-----------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'striped' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <stripes> "," <chunk_size> ","
|
||||
<stripe_data> ";"
|
||||
|
||||
target_name := "target_name=striped"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
stripes := "stripes=" <NS>
|
||||
chunk_size := "chunk_size=" <N>
|
||||
stripe_data := <stripe_data_row>|<stripe_data><stripe_data_row>
|
||||
stripe_data_row := <stripe_device_name> "," <stripe_physical_start> "," <stripe_status>
|
||||
stripe_device_name := "stripe_" <X> "_device_name=" <stripe_device_name_str>
|
||||
where <X> ranges from 0 to (<NS> -1) - for <NS> described in <stripes>.
|
||||
stripe_physical_start := "stripe_" <X> "_physical_start=" <N>
|
||||
where <X> ranges from 0 to (<NS> -1) - for <NS> described in <stripes>.
|
||||
stripe_status := "stripe_" <X> "_status=" <stripe_status_str>
|
||||
where <X> ranges from 0 to (<NS> -1) - for <NS> described in <stripes>.
|
||||
stripe_status_str := "D" | "A"
|
||||
|
||||
E.g.
|
||||
When a 'striped' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'striped' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=striped1,uuid=striped_uuid1,major=253,minor=5,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=640,target_name=striped,target_version=1.6.0,stripes=2,chunk_size=64,
|
||||
stripe_0_device_name=253:0,stripe_0_physical_start=2048,stripe_0_status=A,
|
||||
stripe_1_device_name=253:3,stripe_1_physical_start=2048,stripe_1_status=A;
|
||||
|
||||
10. verity
|
||||
----------
|
||||
The 'target_attributes' (described as part of EVENT_DATA in 'Table load'
|
||||
section above) has the following data format for 'verity' target.
|
||||
|
||||
::
|
||||
|
||||
target_attributes := <target_name> "," <target_version> "," <hash_failed> "," <verity_version> ","
|
||||
<data_device_name> "," <hash_device_name> "," <verity_algorithm> "," <root_digest> ","
|
||||
<salt> "," <ignore_zero_blocks> "," <check_at_most_once> ["," <root_hash_sig_key_desc>]
|
||||
["," <verity_mode>] ";"
|
||||
|
||||
target_name := "target_name=verity"
|
||||
target_version := "target_version=" <N> "." <N> "." <N>
|
||||
hash_failed := "hash_failed=" <hash_failed_str>
|
||||
hash_failed_str := "C" | "V"
|
||||
verity_version := "verity_version=" <verity_version_str>
|
||||
data_device_name := "data_device_name=" <data_device_name_str>
|
||||
hash_device_name := "hash_device_name=" <hash_device_name_str>
|
||||
verity_algorithm := "verity_algorithm=" <verity_algorithm_str>
|
||||
root_digest := "root_digest=" <root_digest_str>
|
||||
salt := "salt=" <salt_str>
|
||||
salt_str := "-" <verity_salt_str>
|
||||
ignore_zero_blocks := "ignore_zero_blocks=" <yes_no>
|
||||
check_at_most_once := "check_at_most_once=" <yes_no>
|
||||
root_hash_sig_key_desc := "root_hash_sig_key_desc="
|
||||
verity_mode := "verity_mode=" <verity_mode_str>
|
||||
verity_mode_str := "ignore_corruption" | "restart_on_corruption" | "panic_on_corruption" | "invalid"
|
||||
yes_no := "y" | "n"
|
||||
|
||||
E.g.
|
||||
When a 'verity' target is loaded, then IMA ASCII measurement log will have an entry
|
||||
similar to the following, depicting what 'verity' attributes are measured in EVENT_DATA
|
||||
for 'dm_table_load' event.
|
||||
(converted from ASCII to text for readability)
|
||||
|
||||
dm_version=4.45.0;
|
||||
name=test-verity,uuid=,major=253,minor=2,minor_count=1,num_targets=1;
|
||||
target_index=0,target_begin=0,target_len=1953120,target_name=verity,target_version=1.8.0,hash_failed=V,
|
||||
verity_version=1,data_device_name=253:1,hash_device_name=253:0,verity_algorithm=sha256,
|
||||
root_digest=29cb87e60ce7b12b443ba6008266f3e41e93e403d7f298f8e3f316b29ff89c5e,
|
||||
salt=e48da609055204e89ae53b655ca2216dd983cf3cb829f34f63a297d106d53e2d,
|
||||
ignore_zero_blocks=n,check_at_most_once=n;
|
@ -13,6 +13,7 @@ Device Mapper
|
||||
dm-dust
|
||||
dm-ebs
|
||||
dm-flakey
|
||||
dm-ima
|
||||
dm-init
|
||||
dm-integrity
|
||||
dm-io
|
||||
|
@ -78,13 +78,23 @@ Status:
|
||||
2. the number of blocks
|
||||
3. the number of free blocks
|
||||
4. the number of blocks under writeback
|
||||
5. the number of read requests
|
||||
6. the number of read requests that hit the cache
|
||||
7. the number of write requests
|
||||
8. the number of write requests that hit uncommitted block
|
||||
9. the number of write requests that hit committed block
|
||||
10. the number of write requests that bypass the cache
|
||||
11. the number of write requests that are allocated in the cache
|
||||
12. the number of write requests that are blocked on the freelist
|
||||
13. the number of flush requests
|
||||
14. the number of discard requests
|
||||
|
||||
Messages:
|
||||
flush
|
||||
flush the cache device. The message returns successfully
|
||||
Flush the cache device. The message returns successfully
|
||||
if the cache device was flushed without an error
|
||||
flush_on_suspend
|
||||
flush the cache device on next suspend. Use this message
|
||||
Flush the cache device on next suspend. Use this message
|
||||
when you are going to remove the cache device. The proper
|
||||
sequence for removing the cache device is:
|
||||
|
||||
@ -98,3 +108,5 @@ Messages:
|
||||
6. the cache device is now inactive and it can be deleted
|
||||
cleaner
|
||||
See above "cleaner" constructor documentation.
|
||||
clear_stats
|
||||
Clear the statistics that are reported on the status line
|
||||
|
@ -2993,10 +2993,10 @@
|
||||
65 = /dev/infiniband/issm1 Second InfiniBand IsSM device
|
||||
...
|
||||
127 = /dev/infiniband/issm63 63rd InfiniBand IsSM device
|
||||
128 = /dev/infiniband/uverbs0 First InfiniBand verbs device
|
||||
129 = /dev/infiniband/uverbs1 Second InfiniBand verbs device
|
||||
192 = /dev/infiniband/uverbs0 First InfiniBand verbs device
|
||||
193 = /dev/infiniband/uverbs1 Second InfiniBand verbs device
|
||||
...
|
||||
159 = /dev/infiniband/uverbs31 31st InfiniBand verbs device
|
||||
223 = /dev/infiniband/uverbs31 31st InfiniBand verbs device
|
||||
|
||||
232 char Biometric Devices
|
||||
0 = /dev/biometric/sensor0/fingerprint first fingerprint sensor on first device
|
||||
|
@ -181,10 +181,12 @@ Open cross-HT issues that core scheduling does not solve
|
||||
--------------------------------------------------------
|
||||
1. For MDS
|
||||
~~~~~~~~~~
|
||||
Core scheduling cannot protect against MDS attacks between an HT running in
|
||||
user mode and another running in kernel mode. Even though both HTs run tasks
|
||||
which trust each other, kernel memory is still considered untrusted. Such
|
||||
attacks are possible for any combination of sibling CPU modes (host or guest mode).
|
||||
Core scheduling cannot protect against MDS attacks between the siblings
|
||||
running in user mode and the others running in kernel mode. Even though all
|
||||
siblings run tasks which trust each other, when the kernel is executing
|
||||
code on behalf of a task, it cannot trust the code running in the
|
||||
sibling. Such attacks are possible for any combination of sibling CPU modes
|
||||
(host or guest mode).
|
||||
|
||||
2. For L1TF
|
||||
~~~~~~~~~~~
|
||||
|
@ -16,3 +16,4 @@ are configurable at compile, boot or run time.
|
||||
multihit.rst
|
||||
special-register-buffer-data-sampling.rst
|
||||
core-scheduling.rst
|
||||
l1d_flush.rst
|
||||
|
69
Documentation/admin-guide/hw-vuln/l1d_flush.rst
Normal file
69
Documentation/admin-guide/hw-vuln/l1d_flush.rst
Normal file
@ -0,0 +1,69 @@
|
||||
L1D Flushing
|
||||
============
|
||||
|
||||
With an increasing number of vulnerabilities being reported around data
|
||||
leaks from the Level 1 Data cache (L1D) the kernel provides an opt-in
|
||||
mechanism to flush the L1D cache on context switch.
|
||||
|
||||
This mechanism can be used to address e.g. CVE-2020-0550. For applications
|
||||
the mechanism keeps them safe from vulnerabilities, related to leaks
|
||||
(snooping of) from the L1D cache.
|
||||
|
||||
|
||||
Related CVEs
|
||||
------------
|
||||
The following CVEs can be addressed by this
|
||||
mechanism
|
||||
|
||||
============= ======================== ==================
|
||||
CVE-2020-0550 Improper Data Forwarding OS related aspects
|
||||
============= ======================== ==================
|
||||
|
||||
Usage Guidelines
|
||||
----------------
|
||||
|
||||
Please see document: :ref:`Documentation/userspace-api/spec_ctrl.rst
|
||||
<set_spec_ctrl>` for details.
|
||||
|
||||
**NOTE**: The feature is disabled by default, applications need to
|
||||
specifically opt into the feature to enable it.
|
||||
|
||||
Mitigation
|
||||
----------
|
||||
|
||||
When PR_SET_L1D_FLUSH is enabled for a task a flush of the L1D cache is
|
||||
performed when the task is scheduled out and the incoming task belongs to a
|
||||
different process and therefore to a different address space.
|
||||
|
||||
If the underlying CPU supports L1D flushing in hardware, the hardware
|
||||
mechanism is used, software fallback for the mitigation, is not supported.
|
||||
|
||||
Mitigation control on the kernel command line
|
||||
---------------------------------------------
|
||||
|
||||
The kernel command line allows to control the L1D flush mitigations at boot
|
||||
time with the option "l1d_flush=". The valid arguments for this option are:
|
||||
|
||||
============ =============================================================
|
||||
on Enables the prctl interface, applications trying to use
|
||||
the prctl() will fail with an error if l1d_flush is not
|
||||
enabled
|
||||
============ =============================================================
|
||||
|
||||
By default the mechanism is disabled.
|
||||
|
||||
Limitations
|
||||
-----------
|
||||
|
||||
The mechanism does not mitigate L1D data leaks between tasks belonging to
|
||||
different processes which are concurrently executing on sibling threads of
|
||||
a physical CPU core when SMT is enabled on the system.
|
||||
|
||||
This can be addressed by controlled placement of processes on physical CPU
|
||||
cores or by disabling SMT. See the relevant chapter in the L1TF mitigation
|
||||
document: :ref:`Documentation/admin-guide/hw-vuln/l1tf.rst <smt_control>`.
|
||||
|
||||
**NOTE** : The opt-in of a task for L1D flushing works only when the task's
|
||||
affinity is limited to cores running in non-SMT mode. If a task which
|
||||
requested L1D flushing is scheduled on a SMT-enabled core the kernel sends
|
||||
a SIGBUS to the task.
|
@ -287,13 +287,21 @@
|
||||
do not want to use tracing_snapshot_alloc() as it needs
|
||||
to be done where GFP_KERNEL allocations are allowed.
|
||||
|
||||
allow_mismatched_32bit_el0 [ARM64]
|
||||
Allow execve() of 32-bit applications and setting of the
|
||||
PER_LINUX32 personality on systems where only a strict
|
||||
subset of the CPUs support 32-bit EL0. When this
|
||||
parameter is present, the set of CPUs supporting 32-bit
|
||||
EL0 is indicated by /sys/devices/system/cpu/aarch32_el0
|
||||
and hot-unplug operations may be restricted.
|
||||
|
||||
See Documentation/arm64/asymmetric-32bit.rst for more
|
||||
information.
|
||||
|
||||
amd_iommu= [HW,X86-64]
|
||||
Pass parameters to the AMD IOMMU driver in the system.
|
||||
Possible values are:
|
||||
fullflush - enable flushing of IO/TLB entries when
|
||||
they are unmapped. Otherwise they are
|
||||
flushed before they will be reused, which
|
||||
is a lot of faster
|
||||
fullflush - Deprecated, equivalent to iommu.strict=1
|
||||
off - do not initialize any AMD IOMMU found in
|
||||
the system
|
||||
force_isolation - Force device isolation for all
|
||||
@ -380,6 +388,9 @@
|
||||
arm64.nopauth [ARM64] Unconditionally disable Pointer Authentication
|
||||
support
|
||||
|
||||
arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension
|
||||
support
|
||||
|
||||
ataflop= [HW,M68k]
|
||||
|
||||
atarimouse= [HW,MOUSE] Atari Mouse
|
||||
@ -1747,6 +1758,11 @@
|
||||
support for the idxd driver. By default it is set to
|
||||
true (1).
|
||||
|
||||
idxd.tc_override= [HW]
|
||||
Format: <bool>
|
||||
Allow override of default traffic class configuration
|
||||
for the device. By default it is set to false (0).
|
||||
|
||||
ieee754= [MIPS] Select IEEE Std 754 conformance mode
|
||||
Format: { strict | legacy | 2008 | relaxed }
|
||||
Default: strict
|
||||
@ -1944,18 +1960,17 @@
|
||||
this case, gfx device will use physical address for
|
||||
DMA.
|
||||
strict [Default Off]
|
||||
With this option on every unmap_single operation will
|
||||
result in a hardware IOTLB flush operation as opposed
|
||||
to batching them for performance.
|
||||
Deprecated, equivalent to iommu.strict=1.
|
||||
sp_off [Default Off]
|
||||
By default, super page will be supported if Intel IOMMU
|
||||
has the capability. With this option, super page will
|
||||
not be supported.
|
||||
sm_on [Default Off]
|
||||
By default, scalable mode will be disabled even if the
|
||||
hardware advertises that it has support for the scalable
|
||||
mode translation. With this option set, scalable mode
|
||||
will be used on hardware which claims to support it.
|
||||
sm_on
|
||||
Enable the Intel IOMMU scalable mode if the hardware
|
||||
advertises that it has support for the scalable mode
|
||||
translation.
|
||||
sm_off
|
||||
Disallow use of the Intel IOMMU scalable mode.
|
||||
tboot_noforce [Default Off]
|
||||
Do not force the Intel IOMMU enabled under tboot.
|
||||
By default, tboot will force Intel IOMMU on, which
|
||||
@ -2047,13 +2062,12 @@
|
||||
throughput at the cost of reduced device isolation.
|
||||
Will fall back to strict mode if not supported by
|
||||
the relevant IOMMU driver.
|
||||
1 - Strict mode (default).
|
||||
1 - Strict mode.
|
||||
DMA unmap operations invalidate IOMMU hardware TLBs
|
||||
synchronously.
|
||||
Note: on x86, the default behaviour depends on the
|
||||
equivalent driver-specific parameters, but a strict
|
||||
mode explicitly specified by either method takes
|
||||
precedence.
|
||||
unset - Use value of CONFIG_IOMMU_DEFAULT_DMA_{LAZY,STRICT}.
|
||||
Note: on x86, strict mode specified via one of the
|
||||
legacy driver-specific options takes precedence.
|
||||
|
||||
iommu.passthrough=
|
||||
[ARM64, X86] Configure DMA to bypass the IOMMU by default.
|
||||
@ -2421,6 +2435,23 @@
|
||||
feature (tagged TLBs) on capable Intel chips.
|
||||
Default is 1 (enabled)
|
||||
|
||||
l1d_flush= [X86,INTEL]
|
||||
Control mitigation for L1D based snooping vulnerability.
|
||||
|
||||
Certain CPUs are vulnerable to an exploit against CPU
|
||||
internal buffers which can forward information to a
|
||||
disclosure gadget under certain conditions.
|
||||
|
||||
In vulnerable processors, the speculatively
|
||||
forwarded data can be used in a cache side channel
|
||||
attack, to access data to which the attacker does
|
||||
not have direct access.
|
||||
|
||||
This parameter controls the mitigation. The
|
||||
options are:
|
||||
|
||||
on - enable the interface for the mitigation
|
||||
|
||||
l1tf= [X86] Control mitigation of the L1TF vulnerability on
|
||||
affected CPUs
|
||||
|
||||
@ -4167,6 +4198,15 @@
|
||||
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
|
||||
default: disabled
|
||||
|
||||
printk.console_no_auto_verbose=
|
||||
Disable console loglevel raise on oops, panic
|
||||
or lockdep-detected issues (only if lock debug is on).
|
||||
With an exception to setups with low baudrate on
|
||||
serial console, keeping this 0 is a good choice
|
||||
in order to provide more debug information.
|
||||
Format: <bool>
|
||||
default: 0 (auto_verbose is enabled)
|
||||
|
||||
printk.devkmsg={on,off,ratelimit}
|
||||
Control writing to /dev/kmsg.
|
||||
on - unlimited logging to /dev/kmsg from userspace
|
||||
@ -4777,7 +4817,7 @@
|
||||
|
||||
reboot= [KNL]
|
||||
Format (x86 or x86_64):
|
||||
[w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \
|
||||
[w[arm] | c[old] | h[ard] | s[oft] | g[pio]] | d[efault] \
|
||||
[[,]s[mp]#### \
|
||||
[[,]b[ios] | a[cpi] | k[bd] | t[riple] | e[fi] | p[ci]] \
|
||||
[[,]f[orce]
|
||||
@ -4945,8 +4985,6 @@
|
||||
sa1100ir [NET]
|
||||
See drivers/net/irda/sa1100_ir.c.
|
||||
|
||||
sbni= [NET] Granch SBNI12 leased line adapter
|
||||
|
||||
sched_verbose [KNL] Enables verbose scheduler debug messages.
|
||||
|
||||
schedstats= [KNL,X86] Enable or disable scheduled statistics.
|
||||
|
@ -13,10 +13,8 @@ Hotkeys
|
||||
The following FN keys are ignored by the kernel without this driver:
|
||||
|
||||
- FN-F1 (LG control panel) - Generates F15
|
||||
- FN-F5 (Touchpad toggle) - Generates F13
|
||||
- FN-F5 (Touchpad toggle) - Generates F21
|
||||
- FN-F6 (Airplane mode) - Generates RFKILL
|
||||
- FN-F8 (Keyboard backlight) - Generates F16.
|
||||
This key also changes keyboard backlight mode.
|
||||
- FN-F9 (Reader mode) - Generates F14
|
||||
|
||||
The rest of the FN keys work without a need for a special driver.
|
||||
|
15
Documentation/admin-guide/mm/damon/index.rst
Normal file
15
Documentation/admin-guide/mm/damon/index.rst
Normal file
@ -0,0 +1,15 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
========================
|
||||
Monitoring Data Accesses
|
||||
========================
|
||||
|
||||
:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring.
|
||||
Using DAMON, users can analyze the memory access patterns of their systems and
|
||||
optimize those.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
start
|
||||
usage
|
114
Documentation/admin-guide/mm/damon/start.rst
Normal file
114
Documentation/admin-guide/mm/damon/start.rst
Normal file
@ -0,0 +1,114 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===============
|
||||
Getting Started
|
||||
===============
|
||||
|
||||
This document briefly describes how you can use DAMON by demonstrating its
|
||||
default user space tool. Please note that this document describes only a part
|
||||
of its features for brevity. Please refer to :doc:`usage` for more details.
|
||||
|
||||
|
||||
TL; DR
|
||||
======
|
||||
|
||||
Follow the commands below to monitor and visualize the memory access pattern of
|
||||
your workload. ::
|
||||
|
||||
# # build the kernel with CONFIG_DAMON_*=y, install it, and reboot
|
||||
# mount -t debugfs none /sys/kernel/debug/
|
||||
# git clone https://github.com/awslabs/damo
|
||||
# ./damo/damo record $(pidof <your workload>)
|
||||
# ./damo/damo report heat --plot_ascii
|
||||
|
||||
The final command draws the access heatmap of ``<your workload>``. The heatmap
|
||||
shows which memory region (x-axis) is accessed when (y-axis) and how frequently
|
||||
(number; the higher the more accesses have been observed). ::
|
||||
|
||||
111111111111111111111111111111111111111111111111111111110000
|
||||
111121111111111111111111111111211111111111111111111111110000
|
||||
000000000000000000000000000000000000000000000000001555552000
|
||||
000000000000000000000000000000000000000000000222223555552000
|
||||
000000000000000000000000000000000000000011111677775000000000
|
||||
000000000000000000000000000000000000000488888000000000000000
|
||||
000000000000000000000000000000000177888400000000000000000000
|
||||
000000000000000000000000000046666522222100000000000000000000
|
||||
000000000000000000000014444344444300000000000000000000000000
|
||||
000000000000000002222245555510000000000000000000000000000000
|
||||
# access_frequency: 0 1 2 3 4 5 6 7 8 9
|
||||
# x-axis: space (140286319947776-140286426374096: 101.496 MiB)
|
||||
# y-axis: time (605442256436361-605479951866441: 37.695430s)
|
||||
# resolution: 60x10 (1.692 MiB and 3.770s for each character)
|
||||
|
||||
|
||||
Prerequisites
|
||||
=============
|
||||
|
||||
Kernel
|
||||
------
|
||||
|
||||
You should first ensure your system is running on a kernel built with
|
||||
``CONFIG_DAMON_*=y``.
|
||||
|
||||
|
||||
User Space Tool
|
||||
---------------
|
||||
|
||||
For the demonstration, we will use the default user space tool for DAMON,
|
||||
called DAMON Operator (DAMO). It is available at
|
||||
https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
|
||||
your ``$PATH``. It's not mandatory, though.
|
||||
|
||||
Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
|
||||
detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
|
||||
below::
|
||||
|
||||
# mount -t debugfs none /sys/kernel/debug/
|
||||
|
||||
or append the following line to your ``/etc/fstab`` file so that your system
|
||||
can automatically mount debugfs upon booting::
|
||||
|
||||
debugfs /sys/kernel/debug debugfs defaults 0 0
|
||||
|
||||
|
||||
Recording Data Access Patterns
|
||||
==============================
|
||||
|
||||
The commands below record the memory access patterns of a program and save the
|
||||
monitoring results to a file. ::
|
||||
|
||||
$ git clone https://github.com/sjp38/masim
|
||||
$ cd masim; make; ./masim ./configs/zigzag.cfg &
|
||||
$ sudo damo record -o damon.data $(pidof masim)
|
||||
|
||||
The first two lines of the commands download an artificial memory access
|
||||
generator program and run it in the background. The generator will repeatedly
|
||||
access two 100 MiB sized memory regions one by one. You can substitute this
|
||||
with your real workload. The last line asks ``damo`` to record the access
|
||||
pattern in the ``damon.data`` file.
|
||||
|
||||
|
||||
Visualizing Recorded Patterns
|
||||
=============================
|
||||
|
||||
The following three commands visualize the recorded access patterns and save
|
||||
the results as separate image files. ::
|
||||
|
||||
$ damo report heats --heatmap access_pattern_heatmap.png
|
||||
$ damo report wss --range 0 101 1 --plot wss_dist.png
|
||||
$ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
|
||||
|
||||
- ``access_pattern_heatmap.png`` will visualize the data access pattern in a
|
||||
heatmap, showing which memory region (y-axis) got accessed when (x-axis)
|
||||
and how frequently (color).
|
||||
- ``wss_dist.png`` will show the distribution of the working set size.
|
||||
- ``wss_chron_change.png`` will show how the working set size has
|
||||
chronologically changed.
|
||||
|
||||
You can view the visualizations of this example workload at [1]_.
|
||||
Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_.
|
||||
|
||||
.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
|
||||
.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
|
||||
.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
|
||||
.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
|
112
Documentation/admin-guide/mm/damon/usage.rst
Normal file
112
Documentation/admin-guide/mm/damon/usage.rst
Normal file
@ -0,0 +1,112 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
===============
|
||||
Detailed Usages
|
||||
===============
|
||||
|
||||
DAMON provides below three interfaces for different users.
|
||||
|
||||
- *DAMON user space tool.*
|
||||
This is for privileged people such as system administrators who want a
|
||||
just-working human-friendly interface. Using this, users can use the DAMON’s
|
||||
major features in a human-friendly way. It may not be highly tuned for
|
||||
special cases, though. It supports only virtual address spaces monitoring.
|
||||
- *debugfs interface.*
|
||||
This is for privileged user space programmers who want more optimized use of
|
||||
DAMON. Using this, users can use DAMON’s major features by reading
|
||||
from and writing to special debugfs files. Therefore, you can write and use
|
||||
your personalized DAMON debugfs wrapper programs that reads/writes the
|
||||
debugfs files instead of you. The DAMON user space tool is also a reference
|
||||
implementation of such programs. It supports only virtual address spaces
|
||||
monitoring.
|
||||
- *Kernel Space Programming Interface.*
|
||||
This is for kernel space programmers. Using this, users can utilize every
|
||||
feature of DAMON most flexibly and efficiently by writing kernel space
|
||||
DAMON application programs for you. You can even extend DAMON for various
|
||||
address spaces.
|
||||
|
||||
Nevertheless, you could write your own user space tool using the debugfs
|
||||
interface. A reference implementation is available at
|
||||
https://github.com/awslabs/damo. If you are a kernel programmer, you could
|
||||
refer to :doc:`/vm/damon/api` for the kernel space programming interface. For
|
||||
the reason, this document describes only the debugfs interface
|
||||
|
||||
debugfs Interface
|
||||
=================
|
||||
|
||||
DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under
|
||||
its debugfs directory, ``<debugfs>/damon/``.
|
||||
|
||||
|
||||
Attributes
|
||||
----------
|
||||
|
||||
Users can get and set the ``sampling interval``, ``aggregation interval``,
|
||||
``regions update interval``, and min/max number of monitoring target regions by
|
||||
reading from and writing to the ``attrs`` file. To know about the monitoring
|
||||
attributes in detail, please refer to the :doc:`/vm/damon/design`. For
|
||||
example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
|
||||
1000, and then check it again::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# echo 5000 100000 1000000 10 1000 > attrs
|
||||
# cat attrs
|
||||
5000 100000 1000000 10 1000
|
||||
|
||||
|
||||
Target IDs
|
||||
----------
|
||||
|
||||
Some types of address spaces supports multiple monitoring target. For example,
|
||||
the virtual memory address spaces monitoring can have multiple processes as the
|
||||
monitoring targets. Users can set the targets by writing relevant id values of
|
||||
the targets to, and get the ids of the current targets by reading from the
|
||||
``target_ids`` file. In case of the virtual address spaces monitoring, the
|
||||
values should be pids of the monitoring target processes. For example, below
|
||||
commands set processes having pids 42 and 4242 as the monitoring targets and
|
||||
check it again::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# echo 42 4242 > target_ids
|
||||
# cat target_ids
|
||||
42 4242
|
||||
|
||||
Note that setting the target ids doesn't start the monitoring.
|
||||
|
||||
|
||||
Turning On/Off
|
||||
--------------
|
||||
|
||||
Setting the files as described above doesn't incur effect unless you explicitly
|
||||
start the monitoring. You can start, stop, and check the current status of the
|
||||
monitoring by writing to and reading from the ``monitor_on`` file. Writing
|
||||
``on`` to the file starts the monitoring of the targets with the attributes.
|
||||
Writing ``off`` to the file stops those. DAMON also stops if every target
|
||||
process is terminated. Below example commands turn on, off, and check the
|
||||
status of DAMON::
|
||||
|
||||
# cd <debugfs>/damon
|
||||
# echo on > monitor_on
|
||||
# echo off > monitor_on
|
||||
# cat monitor_on
|
||||
off
|
||||
|
||||
Please note that you cannot write to the above-mentioned debugfs files while
|
||||
the monitoring is turned on. If you write to the files while DAMON is running,
|
||||
an error code such as ``-EBUSY`` will be returned.
|
||||
|
||||
|
||||
Tracepoint for Monitoring Results
|
||||
=================================
|
||||
|
||||
DAMON provides the monitoring results via a tracepoint,
|
||||
``damon:damon_aggregated``. While the monitoring is turned on, you could
|
||||
record the tracepoint events and show results using tracepoint supporting tools
|
||||
like ``perf``. For example::
|
||||
|
||||
# echo on > monitor_on
|
||||
# perf record -e damon:damon_aggregated &
|
||||
# sleep 5
|
||||
# kill 9 $(pidof perf)
|
||||
# echo off > monitor_on
|
||||
# perf script
|
@ -27,6 +27,7 @@ the Linux memory management.
|
||||
|
||||
concepts
|
||||
cma_debugfs
|
||||
damon/index
|
||||
hugetlbpage
|
||||
idle_page_tracking
|
||||
ksm
|
||||
|
@ -1,466 +1,576 @@
|
||||
.. _admin_guide_memory_hotplug:
|
||||
|
||||
==============
|
||||
Memory Hotplug
|
||||
==============
|
||||
==================
|
||||
Memory Hot(Un)Plug
|
||||
==================
|
||||
|
||||
:Created: Jul 28 2007
|
||||
:Updated: Add some details about locking internals: Aug 20 2018
|
||||
|
||||
This document is about memory hotplug including how-to-use and current status.
|
||||
Because Memory Hotplug is still under development, contents of this text will
|
||||
be changed often.
|
||||
This document describes generic Linux support for memory hot(un)plug with
|
||||
a focus on System RAM, including ZONE_MOVABLE support.
|
||||
|
||||
.. contents:: :local:
|
||||
|
||||
.. note::
|
||||
|
||||
(1) x86_64's has special implementation for memory hotplug.
|
||||
This text does not describe it.
|
||||
(2) This text assumes that sysfs is mounted at ``/sys``.
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Purpose of memory hotplug
|
||||
-------------------------
|
||||
Memory hot(un)plug allows for increasing and decreasing the size of physical
|
||||
memory available to a machine at runtime. In the simplest case, it consists of
|
||||
physically plugging or unplugging a DIMM at runtime, coordinated with the
|
||||
operating system.
|
||||
|
||||
Memory Hotplug allows users to increase/decrease the amount of memory.
|
||||
Generally, there are two purposes.
|
||||
Memory hot(un)plug is used for various purposes:
|
||||
|
||||
(A) For changing the amount of memory.
|
||||
This is to allow a feature like capacity on demand.
|
||||
(B) For installing/removing DIMMs or NUMA-nodes physically.
|
||||
This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
|
||||
- The physical memory available to a machine can be adjusted at runtime, up- or
|
||||
downgrading the memory capacity. This dynamic memory resizing, sometimes
|
||||
referred to as "capacity on demand", is frequently used with virtual machines
|
||||
and logical partitions.
|
||||
|
||||
(A) is required by highly virtualized environments and (B) is required by
|
||||
hardware which supports memory power management.
|
||||
- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
|
||||
example is replacing failing memory modules.
|
||||
|
||||
Linux memory hotplug is designed for both purpose.
|
||||
- Reducing energy consumption either by physically unplugging memory modules or
|
||||
by logically unplugging (parts of) memory modules from Linux.
|
||||
|
||||
Phases of memory hotplug
|
||||
------------------------
|
||||
Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also
|
||||
used to expose persistent memory, other performance-differentiated memory and
|
||||
reserved memory regions as ordinary system RAM to Linux.
|
||||
|
||||
There are 2 phases in Memory Hotplug:
|
||||
Linux only supports memory hot(un)plug on selected 64 bit architectures, such as
|
||||
x86_64, arm64, ppc64, s390x and ia64.
|
||||
|
||||
1) Physical Memory Hotplug phase
|
||||
2) Logical Memory Hotplug phase.
|
||||
Memory Hot(Un)Plug Granularity
|
||||
------------------------------
|
||||
|
||||
The First phase is to communicate hardware/firmware and make/erase
|
||||
environment for hotplugged memory. Basically, this phase is necessary
|
||||
for the purpose (B), but this is good phase for communication between
|
||||
highly virtualized environments too.
|
||||
|
||||
When memory is hotplugged, the kernel recognizes new memory, makes new memory
|
||||
management tables, and makes sysfs files for new memory's operation.
|
||||
|
||||
If firmware supports notification of connection of new memory to OS,
|
||||
this phase is triggered automatically. ACPI can notify this event. If not,
|
||||
"probe" operation by system administration is used instead.
|
||||
(see :ref:`memory_hotplug_physical_mem`).
|
||||
|
||||
Logical Memory Hotplug phase is to change memory state into
|
||||
available/unavailable for users. Amount of memory from user's view is
|
||||
changed by this phase. The kernel makes all memory in it as free pages
|
||||
when a memory range is available.
|
||||
|
||||
In this document, this phase is described as online/offline.
|
||||
|
||||
Logical Memory Hotplug phase is triggered by write of sysfs file by system
|
||||
administrator. For the hot-add case, it must be executed after Physical Hotplug
|
||||
phase by hand.
|
||||
(However, if you writes udev's hotplug scripts for memory hotplug, these
|
||||
phases can be execute in seamless way.)
|
||||
|
||||
Unit of Memory online/offline operation
|
||||
---------------------------------------
|
||||
|
||||
Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
|
||||
into chunks of the same size. These chunks are called "sections". The size of
|
||||
a memory section is architecture dependent. For example, power uses 16MiB, ia64
|
||||
uses 1GiB.
|
||||
Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
|
||||
physical memory address space into chunks of the same size: memory sections. The
|
||||
size of a memory section is architecture dependent. For example, x86_64 uses
|
||||
128 MiB and ppc64 uses 16 MiB.
|
||||
|
||||
Memory sections are combined into chunks referred to as "memory blocks". The
|
||||
size of a memory block is architecture dependent and represents the logical
|
||||
unit upon which memory online/offline operations are to be performed. The
|
||||
default size of a memory block is the same as memory section size unless an
|
||||
architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
|
||||
size of a memory block is architecture dependent and corresponds to the smallest
|
||||
granularity that can be hot(un)plugged. The default size of a memory block is
|
||||
the same as memory section size, unless an architecture specifies otherwise.
|
||||
|
||||
To determine the size (in bytes) of a memory block please read this file::
|
||||
All memory blocks have the same size.
|
||||
|
||||
/sys/devices/system/memory/block_size_bytes
|
||||
Phases of Memory Hotplug
|
||||
------------------------
|
||||
|
||||
Kernel Configuration
|
||||
====================
|
||||
Memory hotplug consists of two phases:
|
||||
|
||||
To use memory hotplug feature, kernel must be compiled with following
|
||||
config options.
|
||||
(1) Adding the memory to Linux
|
||||
(2) Onlining memory blocks
|
||||
|
||||
- For all memory hotplug:
|
||||
- Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``)
|
||||
- Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``)
|
||||
In the first phase, metadata, such as the memory map ("memmap") and page tables
|
||||
for the direct mapping, is allocated and initialized, and memory blocks are
|
||||
created; the latter also creates sysfs files for managing newly created memory
|
||||
blocks.
|
||||
|
||||
- To enable memory removal, the following are also necessary:
|
||||
- Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``)
|
||||
- Page Migration (``CONFIG_MIGRATION``)
|
||||
In the second phase, added memory is exposed to the page allocator. After this
|
||||
phase, the memory is visible in memory statistics, such as free and total
|
||||
memory, of the system.
|
||||
|
||||
- For ACPI memory hotplug, the following are also necessary:
|
||||
- Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``)
|
||||
- This option can be kernel module.
|
||||
Phases of Memory Hotunplug
|
||||
--------------------------
|
||||
|
||||
- As a related configuration, if your box has a feature of NUMA-node hotplug
|
||||
via ACPI, then this option is necessary too.
|
||||
Memory hotunplug consists of two phases:
|
||||
|
||||
- ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
|
||||
(``CONFIG_ACPI_CONTAINER``).
|
||||
(1) Offlining memory blocks
|
||||
(2) Removing the memory from Linux
|
||||
|
||||
This option can be kernel module too.
|
||||
In the fist phase, memory is "hidden" from the page allocator again, for
|
||||
example, by migrating busy memory to other memory locations and removing all
|
||||
relevant free pages from the page allocator After this phase, the memory is no
|
||||
longer visible in memory statistics of the system.
|
||||
|
||||
In the second phase, the memory blocks are removed and metadata is freed.
|
||||
|
||||
.. _memory_hotplug_sysfs_files:
|
||||
Memory Hotplug Notifications
|
||||
============================
|
||||
|
||||
sysfs files for memory hotplug
|
||||
There are various ways how Linux is notified about memory hotplug events such
|
||||
that it can start adding hotplugged memory. This description is limited to
|
||||
systems that support ACPI; mechanisms specific to other firmware interfaces or
|
||||
virtual machines are not described.
|
||||
|
||||
ACPI Notifications
|
||||
------------------
|
||||
|
||||
Platforms that support ACPI, such as x86_64, can support memory hotplug
|
||||
notifications via ACPI.
|
||||
|
||||
In general, a firmware supporting memory hotplug defines a memory class object
|
||||
HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
|
||||
driver will hotplug the memory to Linux.
|
||||
|
||||
If the firmware supports hotplug of NUMA nodes, it defines an object _HID
|
||||
"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
|
||||
assigned memory devices are added to Linux by the ACPI driver.
|
||||
|
||||
Similarly, Linux can be notified about requests to hotunplug a memory device or
|
||||
a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
|
||||
blocks, and, if successful, hotunplug the memory from Linux.
|
||||
|
||||
Manual Probing
|
||||
--------------
|
||||
|
||||
On some architectures, the firmware may not be able to notify the operating
|
||||
system about a memory hotplug event. Instead, the memory has to be manually
|
||||
probed from user space.
|
||||
|
||||
The probe interface is located at::
|
||||
|
||||
/sys/devices/system/memory/probe
|
||||
|
||||
Only complete memory blocks can be probed. Individual memory blocks are probed
|
||||
by providing the physical start address of the memory block::
|
||||
|
||||
% echo addr > /sys/devices/system/memory/probe
|
||||
|
||||
Which results in a memory block for the range [addr, addr + memory_block_size)
|
||||
being created.
|
||||
|
||||
.. note::
|
||||
|
||||
Using the probe interface is discouraged as it is easy to crash the kernel,
|
||||
because Linux cannot validate user input; this interface might be removed in
|
||||
the future.
|
||||
|
||||
Onlining and Offlining Memory Blocks
|
||||
====================================
|
||||
|
||||
After a memory block has been created, Linux has to be instructed to actually
|
||||
make use of that memory: the memory block has to be "online".
|
||||
|
||||
Before a memory block can be removed, Linux has to stop using any memory part of
|
||||
the memory block: the memory block has to be "offlined".
|
||||
|
||||
The Linux kernel can be configured to automatically online added memory blocks
|
||||
and drivers automatically trigger offlining of memory blocks when trying
|
||||
hotunplug of memory. Memory blocks can only be removed once offlining succeeded
|
||||
and drivers may trigger offlining of memory blocks when attempting hotunplug of
|
||||
memory.
|
||||
|
||||
Onlining Memory Blocks Manually
|
||||
-------------------------------
|
||||
|
||||
If auto-onlining of memory blocks isn't enabled, user-space has to manually
|
||||
trigger onlining of memory blocks. Often, udev rules are used to automate this
|
||||
task in user space.
|
||||
|
||||
Onlining of a memory block can be triggered via::
|
||||
|
||||
% echo online > /sys/devices/system/memory/memoryXXX/state
|
||||
|
||||
Or alternatively::
|
||||
|
||||
% echo 1 > /sys/devices/system/memory/memoryXXX/online
|
||||
|
||||
The kernel will select the target zone automatically, usually defaulting to
|
||||
``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel
|
||||
command line or if the memory block would intersect the ZONE_MOVABLE already.
|
||||
|
||||
One can explicitly request to associate an offline memory block with
|
||||
ZONE_MOVABLE by::
|
||||
|
||||
% echo online_movable > /sys/devices/system/memory/memoryXXX/state
|
||||
|
||||
Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
|
||||
|
||||
% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
|
||||
|
||||
In any case, if onlining succeeds, the state of the memory block is changed to
|
||||
be "online". If it fails, the state of the memory block will remain unchanged
|
||||
and the above commands will fail.
|
||||
|
||||
Onlining Memory Blocks Automatically
|
||||
------------------------------------
|
||||
|
||||
The kernel can be configured to try auto-onlining of newly added memory blocks.
|
||||
If this feature is disabled, the memory blocks will stay offline until
|
||||
explicitly onlined from user space.
|
||||
|
||||
The configured auto-online behavior can be observed via::
|
||||
|
||||
% cat /sys/devices/system/memory/auto_online_blocks
|
||||
|
||||
Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or
|
||||
``online_movable`` to that file, like::
|
||||
|
||||
% echo online > /sys/devices/system/memory/auto_online_blocks
|
||||
|
||||
Modifying the auto-online behavior will only affect all subsequently added
|
||||
memory blocks only.
|
||||
|
||||
.. note::
|
||||
|
||||
In corner cases, auto-onlining can fail. The kernel won't retry. Note that
|
||||
auto-onlining is not expected to fail in default configurations.
|
||||
|
||||
.. note::
|
||||
|
||||
DLPAR on ppc64 ignores the ``offline`` setting and will still online added
|
||||
memory blocks; if onlining fails, memory blocks are removed again.
|
||||
|
||||
Offlining Memory Blocks
|
||||
-----------------------
|
||||
|
||||
In the current implementation, Linux's memory offlining will try migrating all
|
||||
movable pages off the affected memory block. As most kernel allocations, such as
|
||||
page tables, are unmovable, page migration can fail and, therefore, inhibit
|
||||
memory offlining from succeeding.
|
||||
|
||||
Having the memory provided by memory block managed by ZONE_MOVABLE significantly
|
||||
increases memory offlining reliability; still, memory offlining can fail in
|
||||
some corner cases.
|
||||
|
||||
Further, memory offlining might retry for a long time (or even forever), until
|
||||
aborted by the user.
|
||||
|
||||
Offlining of a memory block can be triggered via::
|
||||
|
||||
% echo offline > /sys/devices/system/memory/memoryXXX/state
|
||||
|
||||
Or alternatively::
|
||||
|
||||
% echo 0 > /sys/devices/system/memory/memoryXXX/online
|
||||
|
||||
If offlining succeeds, the state of the memory block is changed to be "offline".
|
||||
If it fails, the state of the memory block will remain unchanged and the above
|
||||
commands will fail, for example, via::
|
||||
|
||||
bash: echo: write error: Device or resource busy
|
||||
|
||||
or via::
|
||||
|
||||
bash: echo: write error: Invalid argument
|
||||
|
||||
Observing the State of Memory Blocks
|
||||
------------------------------------
|
||||
|
||||
The state (online/offline/going-offline) of a memory block can be observed
|
||||
either via::
|
||||
|
||||
% cat /sys/device/system/memory/memoryXXX/state
|
||||
|
||||
Or alternatively (1/0) via::
|
||||
|
||||
% cat /sys/device/system/memory/memoryXXX/online
|
||||
|
||||
For an online memory block, the managing zone can be observed via::
|
||||
|
||||
% cat /sys/device/system/memory/memoryXXX/valid_zones
|
||||
|
||||
Configuring Memory Hot(Un)Plug
|
||||
==============================
|
||||
|
||||
All memory blocks have their device information in sysfs. Each memory block
|
||||
is described under ``/sys/devices/system/memory`` as::
|
||||
There are various ways how system administrators can configure memory
|
||||
hot(un)plug and interact with memory blocks, especially, to online them.
|
||||
|
||||
Memory Hot(Un)Plug Configuration via Sysfs
|
||||
------------------------------------------
|
||||
|
||||
Some memory hot(un)plug properties can be configured or inspected via sysfs in::
|
||||
|
||||
/sys/devices/system/memory/
|
||||
|
||||
The following files are currently defined:
|
||||
|
||||
====================== =========================================================
|
||||
``auto_online_blocks`` read-write: set or get the default state of new memory
|
||||
blocks; configure auto-onlining.
|
||||
|
||||
The default value depends on the
|
||||
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
|
||||
option.
|
||||
|
||||
See the ``state`` property of memory blocks for details.
|
||||
``block_size_bytes`` read-only: the size in bytes of a memory block.
|
||||
``probe`` write-only: add (probe) selected memory blocks manually
|
||||
from user space by supplying the physical start address.
|
||||
|
||||
Availability depends on the CONFIG_ARCH_MEMORY_PROBE
|
||||
kernel configuration option.
|
||||
``uevent`` read-write: generic udev file for device subsystems.
|
||||
====================== =========================================================
|
||||
|
||||
.. note::
|
||||
|
||||
When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
|
||||
additional files ``hard_offline_page`` and ``soft_offline_page`` are available
|
||||
to trigger hwpoisoning of pages, for example, for testing purposes. Note that
|
||||
this functionality is not really related to memory hot(un)plug or actual
|
||||
offlining of memory blocks.
|
||||
|
||||
Memory Block Configuration via Sysfs
|
||||
------------------------------------
|
||||
|
||||
Each memory block is represented as a memory block device that can be
|
||||
onlined or offlined. All memory blocks have their device information located in
|
||||
sysfs. Each present memory block is listed under
|
||||
``/sys/devices/system/memory`` as::
|
||||
|
||||
/sys/devices/system/memory/memoryXXX
|
||||
|
||||
where XXX is the memory block id.
|
||||
where XXX is the memory block id; the number of digits is variable.
|
||||
|
||||
For the memory block covered by the sysfs directory. It is expected that all
|
||||
memory sections in this range are present and no memory holes exist in the
|
||||
range. Currently there is no way to determine if there is a memory hole, but
|
||||
the existence of one should not affect the hotplug capabilities of the memory
|
||||
block.
|
||||
A present memory block indicates that some memory in the range is present;
|
||||
however, a memory block might span memory holes. A memory block spanning memory
|
||||
holes cannot be offlined.
|
||||
|
||||
For example, assume 1GiB memory block size. A device for a memory starting at
|
||||
For example, assume 1 GiB memory block size. A device for a memory starting at
|
||||
0x100000000 is ``/sys/device/system/memory/memory4``::
|
||||
|
||||
(0x100000000 / 1Gib = 4)
|
||||
|
||||
This device covers address range [0x100000000 ... 0x140000000)
|
||||
|
||||
Under each memory block, you can see 5 files:
|
||||
|
||||
- ``/sys/devices/system/memory/memoryXXX/phys_index``
|
||||
- ``/sys/devices/system/memory/memoryXXX/phys_device``
|
||||
- ``/sys/devices/system/memory/memoryXXX/state``
|
||||
- ``/sys/devices/system/memory/memoryXXX/removable``
|
||||
- ``/sys/devices/system/memory/memoryXXX/valid_zones``
|
||||
The following files are currently defined:
|
||||
|
||||
=================== ============================================================
|
||||
``phys_index`` read-only and contains memory block id, same as XXX.
|
||||
``state`` read-write
|
||||
|
||||
- at read: contains online/offline state of memory.
|
||||
- at write: user can specify "online_kernel",
|
||||
|
||||
"online_movable", "online", "offline" command
|
||||
which will be performed on all sections in the block.
|
||||
``online`` read-write: simplified interface to trigger onlining /
|
||||
offlining and to observe the state of a memory block.
|
||||
When onlining, the zone is selected automatically.
|
||||
``phys_device`` read-only: legacy interface only ever used on s390x to
|
||||
expose the covered storage increment.
|
||||
``phys_index`` read-only: the memory block id (XXX).
|
||||
``removable`` read-only: legacy interface that indicated whether a memory
|
||||
block was likely to be offlineable or not. Newer kernel
|
||||
versions return "1" if and only if the kernel supports
|
||||
memory offlining.
|
||||
``valid_zones`` read-only: designed to show by which zone memory provided by
|
||||
a memory block is managed, and to show by which zone memory
|
||||
provided by an offline memory block could be managed when
|
||||
onlining.
|
||||
block was likely to be offlineable or not. Nowadays, the
|
||||
kernel return ``1`` if and only if it supports memory
|
||||
offlining.
|
||||
``state`` read-write: advanced interface to trigger onlining /
|
||||
offlining and to observe the state of a memory block.
|
||||
|
||||
The first column shows it`s default zone.
|
||||
When writing, ``online``, ``offline``, ``online_kernel`` and
|
||||
``online_movable`` are supported.
|
||||
|
||||
"memory6/valid_zones: Normal Movable" shows this memoryblock
|
||||
can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
|
||||
by online_movable.
|
||||
``online_movable`` specifies onlining to ZONE_MOVABLE.
|
||||
``online_kernel`` specifies onlining to the default kernel
|
||||
zone for the memory block, such as ZONE_NORMAL.
|
||||
``online`` let's the kernel select the zone automatically.
|
||||
|
||||
"memory7/valid_zones: Movable Normal" shows this memoryblock
|
||||
can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
|
||||
by online_kernel.
|
||||
When reading, ``online``, ``offline`` and ``going-offline``
|
||||
may be returned.
|
||||
``uevent`` read-write: generic uevent file for devices.
|
||||
``valid_zones`` read-only: when a block is online, shows the zone it
|
||||
belongs to; when a block is offline, shows what zone will
|
||||
manage it when the block will be onlined.
|
||||
|
||||
For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
|
||||
``Movable`` and ``none`` may be returned. ``none`` indicates
|
||||
that memory provided by a memory block is managed by
|
||||
multiple zones or spans multiple nodes; such memory blocks
|
||||
cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
|
||||
Other values indicate a kernel zone.
|
||||
|
||||
For offline memory blocks, the first column shows the
|
||||
zone the kernel would select when onlining the memory block
|
||||
right now without further specifying a zone.
|
||||
|
||||
Availability depends on the CONFIG_MEMORY_HOTREMOVE
|
||||
kernel configuration option.
|
||||
=================== ============================================================
|
||||
|
||||
.. note::
|
||||
|
||||
These directories/files appear after physical memory hotplug phase.
|
||||
If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
|
||||
directories can also be accessed via symbolic links located in the
|
||||
``/sys/devices/system/node/node*`` directories.
|
||||
|
||||
If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
|
||||
via symbolic links located in the ``/sys/devices/system/node/node*`` directories.
|
||||
|
||||
For example::
|
||||
For example::
|
||||
|
||||
/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
|
||||
|
||||
A backlink will also be created::
|
||||
A backlink will also be created::
|
||||
|
||||
/sys/devices/system/memory/memory9/node0 -> ../../node/node0
|
||||
|
||||
.. _memory_hotplug_physical_mem:
|
||||
Command Line Parameters
|
||||
-----------------------
|
||||
|
||||
Physical memory hot-add phase
|
||||
=============================
|
||||
Some command line parameters affect memory hot(un)plug handling. The following
|
||||
command line parameters are relevant:
|
||||
|
||||
Hardware(Firmware) Support
|
||||
--------------------------
|
||||
======================== =======================================================
|
||||
``memhp_default_state`` configure auto-onlining by essentially setting
|
||||
``/sys/devices/system/memory/auto_online_blocks``.
|
||||
``movablecore`` configure automatic zone selection of the kernel. When
|
||||
set, the kernel will default to ZONE_MOVABLE, unless
|
||||
other zones can be kept contiguous.
|
||||
======================== =======================================================
|
||||
|
||||
On x86_64/ia64 platform, memory hotplug by ACPI is supported.
|
||||
Module Parameters
|
||||
------------------
|
||||
|
||||
In general, the firmware (ACPI) which supports memory hotplug defines
|
||||
memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
|
||||
Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev
|
||||
script. This will be done automatically.
|
||||
Instead of additional command line parameters or sysfs files, the
|
||||
``memory_hotplug`` subsystem now provides a dedicated namespace for module
|
||||
parameters. Module parameters can be set via the command line by predicating
|
||||
them with ``memory_hotplug.`` such as::
|
||||
|
||||
But scripts for memory hotplug are not contained in generic udev package(now).
|
||||
You may have to write it by yourself or online/offline memory by hand.
|
||||
Please see :ref:`memory_hotplug_how_to_online_memory` and
|
||||
:ref:`memory_hotplug_how_to_offline_memory`.
|
||||
memory_hotplug.memmap_on_memory=1
|
||||
|
||||
If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
|
||||
"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
|
||||
calls hotplug code for all of objects which are defined in it.
|
||||
If memory device is found, memory hotplug code will be called.
|
||||
and they can be observed (and some even modified at runtime) via::
|
||||
|
||||
Notify memory hot-add event by hand
|
||||
-----------------------------------
|
||||
/sys/modules/memory_hotplug/parameters/
|
||||
|
||||
On some architectures, the firmware may not notify the kernel of a memory
|
||||
hotplug event. Therefore, the memory "probe" interface is supported to
|
||||
explicitly notify the kernel. This interface depends on
|
||||
CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
|
||||
if hotplug is supported, although for x86 this should be handled by ACPI
|
||||
notification.
|
||||
The following module parameters are currently defined:
|
||||
|
||||
Probe interface is located at::
|
||||
======================== =======================================================
|
||||
``memmap_on_memory`` read-write: Allocate memory for the memmap from the
|
||||
added memory block itself. Even if enabled, actual
|
||||
support depends on various other system properties and
|
||||
should only be regarded as a hint whether the behavior
|
||||
would be desired.
|
||||
|
||||
/sys/devices/system/memory/probe
|
||||
While allocating the memmap from the memory block
|
||||
itself makes memory hotplug less likely to fail and
|
||||
keeps the memmap on the same NUMA node in any case, it
|
||||
can fragment physical memory in a way that huge pages
|
||||
in bigger granularity cannot be formed on hotplugged
|
||||
memory.
|
||||
======================== =======================================================
|
||||
|
||||
You can tell the physical address of new memory to the kernel by::
|
||||
ZONE_MOVABLE
|
||||
============
|
||||
|
||||
% echo start_address_of_new_memory > /sys/devices/system/memory/probe
|
||||
ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
|
||||
Further, having system RAM managed by ZONE_MOVABLE instead of one of the
|
||||
kernel zones can increase the number of possible transparent huge pages and
|
||||
dynamically allocated huge pages.
|
||||
|
||||
Then, [start_address_of_new_memory, start_address_of_new_memory +
|
||||
memory_block_size] memory range is hot-added. In this case, hotplug script is
|
||||
not called (in current implementation). You'll have to online memory by
|
||||
yourself. Please see :ref:`memory_hotplug_how_to_online_memory`.
|
||||
Most kernel allocations are unmovable. Important examples include the memory
|
||||
map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
|
||||
can only be served from the kernel zones.
|
||||
|
||||
Logical Memory hot-add phase
|
||||
============================
|
||||
Most user space pages, such as anonymous memory, and page cache pages are
|
||||
movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
|
||||
|
||||
State of memory
|
||||
Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable
|
||||
allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
|
||||
absolutely no guarantee whether a memory block can be offlined successfully.
|
||||
|
||||
Zone Imbalances
|
||||
---------------
|
||||
|
||||
To see (online/offline) state of a memory block, read 'state' file::
|
||||
Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
|
||||
which can harm the system or degrade performance. As one example, the kernel
|
||||
might crash because it runs out of free memory for unmovable allocations,
|
||||
although there is still plenty of free memory left in ZONE_MOVABLE.
|
||||
|
||||
% cat /sys/device/system/memory/memoryXXX/state
|
||||
Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
|
||||
are definitely impossible due to the overhead for the memory map.
|
||||
|
||||
|
||||
- If the memory block is online, you'll read "online".
|
||||
- If the memory block is offline, you'll read "offline".
|
||||
|
||||
|
||||
.. _memory_hotplug_how_to_online_memory:
|
||||
|
||||
How to online memory
|
||||
--------------------
|
||||
|
||||
When the memory is hot-added, the kernel decides whether or not to "online"
|
||||
it according to the policy which can be read from "auto_online_blocks" file::
|
||||
|
||||
% cat /sys/devices/system/memory/auto_online_blocks
|
||||
|
||||
The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
|
||||
option. If it is disabled the default is "offline" which means the newly added
|
||||
memory is not in a ready-to-use state and you have to "online" the newly added
|
||||
memory blocks manually. Automatic onlining can be requested by writing "online"
|
||||
to "auto_online_blocks" file::
|
||||
|
||||
% echo online > /sys/devices/system/memory/auto_online_blocks
|
||||
|
||||
This sets a global policy and impacts all memory blocks that will subsequently
|
||||
be hotplugged. Currently offline blocks keep their state. It is possible, under
|
||||
certain circumstances, that some memory blocks will be added but will fail to
|
||||
online. User space tools can check their "state" files
|
||||
(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually.
|
||||
|
||||
If the automatic onlining wasn't requested, failed, or some memory block was
|
||||
offlined it is possible to change the individual block's state by writing to the
|
||||
"state" file::
|
||||
|
||||
% echo online > /sys/devices/system/memory/memoryXXX/state
|
||||
|
||||
This onlining will not change the ZONE type of the target memory block,
|
||||
If the memory block doesn't belong to any zone an appropriate kernel zone
|
||||
(usually ZONE_NORMAL) will be used unless movable_node kernel command line
|
||||
option is specified when ZONE_MOVABLE will be used.
|
||||
|
||||
You can explicitly request to associate it with ZONE_MOVABLE by::
|
||||
|
||||
% echo online_movable > /sys/devices/system/memory/memoryXXX/state
|
||||
|
||||
.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
|
||||
|
||||
Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
|
||||
|
||||
% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
|
||||
|
||||
.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
|
||||
|
||||
An explicit zone onlining can fail (e.g. when the range is already within
|
||||
and existing and incompatible zone already).
|
||||
|
||||
After this, memory block XXX's state will be 'online' and the amount of
|
||||
available memory will be increased.
|
||||
|
||||
This may be changed in future.
|
||||
|
||||
Logical memory remove
|
||||
=====================
|
||||
|
||||
Memory offline and ZONE_MOVABLE
|
||||
-------------------------------
|
||||
|
||||
Memory offlining is more complicated than memory online. Because memory offline
|
||||
has to make the whole memory block be unused, memory offline can fail if
|
||||
the memory block includes memory which cannot be freed.
|
||||
|
||||
In general, memory offline can use 2 techniques.
|
||||
|
||||
(1) reclaim and free all memory in the memory block.
|
||||
(2) migrate all pages in the memory block.
|
||||
|
||||
In the current implementation, Linux's memory offline uses method (2), freeing
|
||||
all pages in the memory block by page migration. But not all pages are
|
||||
migratable. Under current Linux, migratable pages are anonymous pages and
|
||||
page caches. For offlining a memory block by migration, the kernel has to
|
||||
guarantee that the memory block contains only migratable pages.
|
||||
|
||||
Now, a boot option for making a memory block which consists of migratable pages
|
||||
is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
|
||||
create ZONE_MOVABLE...a zone which is just used for movable pages.
|
||||
(See also Documentation/admin-guide/kernel-parameters.rst)
|
||||
|
||||
Assume the system has "TOTAL" amount of memory at boot time, this boot option
|
||||
creates ZONE_MOVABLE as following.
|
||||
|
||||
1) When kernelcore=YYYY boot option is used,
|
||||
Size of memory not for movable pages (not for offline) is YYYY.
|
||||
Size of memory for movable pages (for offline) is TOTAL-YYYY.
|
||||
|
||||
2) When movablecore=ZZZZ boot option is used,
|
||||
Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
|
||||
Size of memory for movable pages (for offline) is ZZZZ.
|
||||
Actual safe zone ratios depend on the workload. Extreme cases, like excessive
|
||||
long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
|
||||
|
||||
.. note::
|
||||
|
||||
Unfortunately, there is no information to show which memory block belongs
|
||||
to ZONE_MOVABLE. This is TBD.
|
||||
CMA memory part of a kernel zone essentially behaves like memory in
|
||||
ZONE_MOVABLE and similar considerations apply, especially when combining
|
||||
CMA with ZONE_MOVABLE.
|
||||
|
||||
Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE
|
||||
and the feature of freeing unused vmemmap pages associated with each hugetlb
|
||||
page is enabled.
|
||||
ZONE_MOVABLE Sizing Considerations
|
||||
----------------------------------
|
||||
|
||||
This can happen when we have plenty of ZONE_MOVABLE memory, but not enough
|
||||
kernel memory to allocate vmemmmap pages. We may even be able to migrate
|
||||
huge page contents, but will not be able to dissolve the source huge page.
|
||||
This will prevent an offline operation and is unfortunate as memory offlining
|
||||
is expected to succeed on movable zones. Users that depend on memory hotplug
|
||||
to succeed for movable zones should carefully consider whether the memory
|
||||
savings gained from this feature are worth the risk of possibly not being
|
||||
able to offline memory in certain situations.
|
||||
We usually expect that a large portion of available system RAM will actually
|
||||
be consumed by user space, either directly or indirectly via the page cache. In
|
||||
the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
|
||||
|
||||
.. note::
|
||||
Techniques that rely on long-term pinnings of memory (especially, RDMA and
|
||||
vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
|
||||
hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
|
||||
memory can still get hot removed - be aware that pinning can fail even if
|
||||
there is plenty of free memory in ZONE_MOVABLE. In addition, using
|
||||
ZONE_MOVABLE might make page pinning more expensive, because pages have to be
|
||||
migrated off that zone first.
|
||||
With that in mind, it makes sense that we can have a big portion of system RAM
|
||||
managed by ZONE_MOVABLE. However, there are some things to consider when using
|
||||
ZONE_MOVABLE, especially when fine-tuning zone ratios:
|
||||
|
||||
.. _memory_hotplug_how_to_offline_memory:
|
||||
- Having a lot of offline memory blocks. Even offline memory blocks consume
|
||||
memory for metadata and page tables in the direct map; having a lot of offline
|
||||
memory blocks is not a typical case, though.
|
||||
|
||||
How to offline memory
|
||||
---------------------
|
||||
- Memory ballooning without balloon compaction is incompatible with
|
||||
ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
|
||||
pseries CMM, fully support balloon compaction.
|
||||
|
||||
You can offline a memory block by using the same sysfs interface that was used
|
||||
in memory onlining::
|
||||
Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
|
||||
disabled. In that case, balloon inflation will only perform unmovable
|
||||
allocations and silently create a zone imbalance, usually triggered by
|
||||
inflation requests from the hypervisor.
|
||||
|
||||
% echo offline > /sys/devices/system/memory/memoryXXX/state
|
||||
- Gigantic pages are unmovable, resulting in user space consuming a
|
||||
lot of unmovable memory.
|
||||
|
||||
If offline succeeds, the state of the memory block is changed to be "offline".
|
||||
If it fails, some error core (like -EBUSY) will be returned by the kernel.
|
||||
Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
|
||||
it. If it doesn't contain 'unmovable' memory, you'll get success.
|
||||
- Huge pages are unmovable when an architectures does not support huge
|
||||
page migration, resulting in a similar issue as with gigantic pages.
|
||||
|
||||
A memory block under ZONE_MOVABLE is considered to be able to be offlined
|
||||
easily. But under some busy state, it may return -EBUSY. Even if a memory
|
||||
block cannot be offlined due to -EBUSY, you can retry offlining it and may be
|
||||
able to offline it (or not). (For example, a page is referred to by some kernel
|
||||
internal call and released soon.)
|
||||
- Page tables are unmovable. Excessive swapping, mapping extremely large
|
||||
files or ZONE_DEVICE memory can be problematic, although only really relevant
|
||||
in corner cases. When we manage a lot of user space memory that has been
|
||||
swapped out or is served from a file/persistent memory/... we still need a lot
|
||||
of page tables to manage that memory once user space accessed that memory.
|
||||
|
||||
Consideration:
|
||||
Memory hotplug's design direction is to make the possibility of memory
|
||||
offlining higher and to guarantee unplugging memory under any situation. But
|
||||
it needs more work. Returning -EBUSY under some situation may be good because
|
||||
the user can decide to retry more or not by himself. Currently, memory
|
||||
offlining code does some amount of retry with 120 seconds timeout.
|
||||
- In certain DAX configurations the memory map for the device memory will be
|
||||
allocated from the kernel zones.
|
||||
|
||||
Physical memory remove
|
||||
======================
|
||||
- KASAN can have a significant memory overhead, for example, consuming 1/8th of
|
||||
the total system memory size as (unmovable) tracking metadata.
|
||||
|
||||
Need more implementation yet....
|
||||
- Notification completion of remove works by OS to firmware.
|
||||
- Guard from remove if not yet.
|
||||
- Long-term pinning of pages. Techniques that rely on long-term pinnings
|
||||
(especially, RDMA and vfio/mdev) are fundamentally problematic with
|
||||
ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
|
||||
on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
|
||||
have to be migrated off that zone while pinning. Pinning a page can fail
|
||||
even if there is plenty of free memory in ZONE_MOVABLE.
|
||||
|
||||
In addition, using ZONE_MOVABLE might make page pinning more expensive,
|
||||
because of the page migration overhead.
|
||||
|
||||
Locking Internals
|
||||
=================
|
||||
By default, all the memory configured at boot time is managed by the kernel
|
||||
zones and ZONE_MOVABLE is not used.
|
||||
|
||||
When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
|
||||
the device_hotplug_lock should be held to:
|
||||
To enable ZONE_MOVABLE to include the memory present at boot and to control the
|
||||
ratio between movable and kernel zones there are two command line options:
|
||||
``kernelcore=`` and ``movablecore=``. See
|
||||
Documentation/admin-guide/kernel-parameters.rst for their description.
|
||||
|
||||
- synchronize against online/offline requests (e.g. via sysfs). This way, memory
|
||||
block devices can only be accessed (.online/.state attributes) by user
|
||||
space once memory has been fully added. And when removing memory, we
|
||||
know nobody is in critical sections.
|
||||
- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
|
||||
Memory Offlining and ZONE_MOVABLE
|
||||
---------------------------------
|
||||
|
||||
Especially, there is a possible lock inversion that is avoided using
|
||||
device_hotplug_lock when adding memory and user space tries to online that
|
||||
memory faster than expected:
|
||||
Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
|
||||
block might fail:
|
||||
|
||||
- device_online() will first take the device_lock(), followed by
|
||||
mem_hotplug_lock
|
||||
- add_memory_resource() will first take the mem_hotplug_lock, followed by
|
||||
the device_lock() (while creating the devices, during bus_add_device()).
|
||||
- Memory blocks with memory holes; this applies to memory blocks present during
|
||||
boot and can apply to memory blocks hotplugged via the XEN balloon and the
|
||||
Hyper-V balloon.
|
||||
|
||||
As the device is visible to user space before taking the device_lock(), this
|
||||
can result in a lock inversion.
|
||||
- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
|
||||
offlining; this applies to memory blocks present during boot only.
|
||||
|
||||
onlining/offlining of memory should be done via device_online()/
|
||||
device_offline() - to make sure it is properly synchronized to actions
|
||||
via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
|
||||
- Special memory blocks prevented by the system from getting offlined. Examples
|
||||
include any memory available during boot on arm64 or memory blocks spanning
|
||||
the crashkernel area on s390x; this usually applies to memory blocks present
|
||||
during boot only.
|
||||
|
||||
When adding/removing/onlining/offlining memory or adding/removing
|
||||
heterogeneous/device memory, we should always hold the mem_hotplug_lock in
|
||||
write mode to serialise memory hotplug (e.g. access to global/zone
|
||||
variables).
|
||||
- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
|
||||
memory blocks present during boot only.
|
||||
|
||||
In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
|
||||
mode allows for a quite efficient get_online_mems/put_online_mems
|
||||
implementation, so code accessing memory can protect from that memory
|
||||
vanishing.
|
||||
- Concurrent activity that operates on the same physical memory area, such as
|
||||
allocating gigantic pages, can result in temporary offlining failures.
|
||||
|
||||
- Out of memory when dissolving huge pages, especially when freeing unused
|
||||
vmemmap pages associated with each hugetlb page is enabled.
|
||||
|
||||
Future Work
|
||||
===========
|
||||
Offlining code may be able to migrate huge page contents, but may not be able
|
||||
to dissolve the source huge page because it fails allocating (unmovable) pages
|
||||
for the vmemmap, because the system might not have free memory in the kernel
|
||||
zones left.
|
||||
|
||||
- allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
|
||||
sysctl or new control file.
|
||||
- showing memory block and physical device relationship.
|
||||
- test and make it better memory offlining.
|
||||
- support HugeTLB page migration and offlining.
|
||||
- memmap removing at memory offline.
|
||||
- physical remove memory.
|
||||
Users that depend on memory offlining to succeed for movable zones should
|
||||
carefully consider whether the memory savings gained from this feature are
|
||||
worth the risk of possibly not being able to offline memory in certain
|
||||
situations.
|
||||
|
||||
Further, when running into out of memory situations while migrating pages, or
|
||||
when still encountering permanently unmovable pages within ZONE_MOVABLE
|
||||
(-> BUG), memory offlining will keep retrying until it eventually succeeds.
|
||||
|
||||
When offlining is triggered from user space, the offlining context can be
|
||||
terminated by sending a fatal signal. A timeout based offlining can easily be
|
||||
implemented via::
|
||||
|
||||
% timeout $TIMEOUT offline_block | failure_handling
|
||||
|
@ -245,6 +245,13 @@ MPOL_INTERLEAVED
|
||||
address range or file. During system boot up, the temporary
|
||||
interleaved system default policy works in this mode.
|
||||
|
||||
MPOL_PREFERRED_MANY
|
||||
This mode specifices that the allocation should be preferrably
|
||||
satisfied from the nodemask specified in the policy. If there is
|
||||
a memory pressure on all nodes in the nodemask, the allocation
|
||||
can fall back to all existing numa nodes. This is effectively
|
||||
MPOL_PREFERRED allowed for a mask rather than a single node.
|
||||
|
||||
NUMA memory policy supports the following optional mode flags:
|
||||
|
||||
MPOL_F_STATIC_NODES
|
||||
@ -253,10 +260,10 @@ MPOL_F_STATIC_NODES
|
||||
nodes changes after the memory policy has been defined.
|
||||
|
||||
Without this flag, any time a mempolicy is rebound because of a
|
||||
change in the set of allowed nodes, the node (Preferred) or
|
||||
nodemask (Bind, Interleave) is remapped to the new set of
|
||||
allowed nodes. This may result in nodes being used that were
|
||||
previously undesired.
|
||||
change in the set of allowed nodes, the preferred nodemask (Preferred
|
||||
Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
|
||||
remapped to the new set of allowed nodes. This may result in nodes
|
||||
being used that were previously undesired.
|
||||
|
||||
With this flag, if the user-specified nodes overlap with the
|
||||
nodes allowed by the task's cpuset, then the memory policy is
|
||||
|
@ -118,7 +118,8 @@ compaction_proactiveness
|
||||
|
||||
This tunable takes a value in the range [0, 100] with a default value of
|
||||
20. This tunable determines how aggressively compaction is done in the
|
||||
background. Setting it to 0 disables proactive compaction.
|
||||
background. Write of a non zero value to this tunable will immediately
|
||||
trigger the proactive compaction. Setting it to 0 disables proactive compaction.
|
||||
|
||||
Note that compaction has a non-trivial system-wide impact as pages
|
||||
belonging to different processes are moved around, which could also lead
|
||||
|
@ -72,7 +72,7 @@ On PowerPC
|
||||
|
||||
On other
|
||||
If you know of the key combos for other architectures, please
|
||||
let me know so I can add them to this section.
|
||||
submit a patch to be included in this section.
|
||||
|
||||
On all
|
||||
Write a character to /proc/sysrq-trigger. e.g.::
|
||||
@ -205,10 +205,12 @@ frozen (probably root) filesystem via the FIFREEZE ioctl.
|
||||
Sometimes SysRq seems to get 'stuck' after using it, what can I do?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
That happens to me, also. I've found that tapping shift, alt, and control
|
||||
on both sides of the keyboard, and hitting an invalid sysrq sequence again
|
||||
will fix the problem. (i.e., something like :kbd:`alt-sysrq-z`). Switching to
|
||||
another virtual console (:kbd:`ALT+Fn`) and then back again should also help.
|
||||
When this happens, try tapping shift, alt and control on both sides of the
|
||||
keyboard, and hitting an invalid sysrq sequence again. (i.e., something like
|
||||
:kbd:`alt-sysrq-z`).
|
||||
|
||||
Switching to another virtual console (:kbd:`ALT+Fn`) and then back again
|
||||
should also help.
|
||||
|
||||
I hit SysRq, but nothing seems to happen, what's wrong?
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
@ -58,11 +58,19 @@ Kirkwood family
|
||||
- Product Brief : https://web.archive.org/web/20120616201621/http://www.marvell.com/embedded-processors/kirkwood/assets/88F6180-003_ver1.pdf
|
||||
- Hardware Spec : https://web.archive.org/web/20130730091654/http://www.marvell.com/embedded-processors/kirkwood/assets/HW_88F6180_OpenSource.pdf
|
||||
- Functional Spec: https://web.archive.org/web/20130730091033/http://www.marvell.com/embedded-processors/kirkwood/assets/FS_88F6180_9x_6281_OpenSource.pdf
|
||||
- 88F6280
|
||||
|
||||
- Product Brief : https://web.archive.org/web/20130730091058/http://www.marvell.com/embedded-processors/kirkwood/assets/88F6280_SoC_PB-001.pdf
|
||||
- 88F6281
|
||||
|
||||
- Product Brief : https://web.archive.org/web/20120131133709/http://www.marvell.com/embedded-processors/kirkwood/assets/88F6281-004_ver1.pdf
|
||||
- Hardware Spec : https://web.archive.org/web/20120620073511/http://www.marvell.com/embedded-processors/kirkwood/assets/HW_88F6281_OpenSource.pdf
|
||||
- Functional Spec: https://web.archive.org/web/20130730091033/http://www.marvell.com/embedded-processors/kirkwood/assets/FS_88F6180_9x_6281_OpenSource.pdf
|
||||
- 88F6321
|
||||
- 88F6322
|
||||
- 88F6323
|
||||
|
||||
- Product Brief : https://web.archive.org/web/20120616201639/http://www.marvell.com/embedded-processors/kirkwood/assets/88f632x_pb.pdf
|
||||
Homepage:
|
||||
https://web.archive.org/web/20160513194943/http://www.marvell.com/embedded-processors/kirkwood/
|
||||
Core:
|
||||
@ -89,6 +97,10 @@ Discovery family
|
||||
|
||||
- MV76100
|
||||
|
||||
- Product Brief : https://web.archive.org/web/20140722064429/http://www.marvell.com/embedded-processors/discovery-innovation/assets/MV76100-002_WEB.pdf
|
||||
- Hardware Spec : https://web.archive.org/web/20140722064425/http://www.marvell.com/embedded-processors/discovery-innovation/assets/HW_MV76100_OpenSource.pdf
|
||||
- Functional Spec: https://web.archive.org/web/20111110081125/http://www.marvell.com/embedded-processors/discovery-innovation/assets/FS_MV76100_78100_78200_OpenSource.pdf
|
||||
|
||||
Not supported by the Linux kernel.
|
||||
|
||||
Core:
|
||||
@ -124,17 +136,24 @@ EBU Armada family
|
||||
|
||||
Armada 38x Flavors:
|
||||
- 88F6810 Armada 380
|
||||
- 88F6811 Armada 381
|
||||
- 88F6821 Armada 382
|
||||
- 88F6W21 Armada 383
|
||||
- 88F6820 Armada 385
|
||||
- 88F6825
|
||||
- 88F6828 Armada 388
|
||||
|
||||
- Product infos: https://web.archive.org/web/20181006144616/http://www.marvell.com/embedded-processors/armada-38x/
|
||||
- Functional Spec: https://web.archive.org/web/20200420191927/https://www.marvell.com/content/dam/marvell/en/public-collateral/embedded-processors/marvell-embedded-processors-armada-38x-functional-specifications-2015-11.pdf
|
||||
- Hardware Spec: https://web.archive.org/web/20180713105318/https://www.marvell.com/docs/embedded-processors/assets/marvell-embedded-processors-armada-38x-hardware-specifications-2017-03.pdf
|
||||
- Design guide: https://web.archive.org/web/20180712231737/https://www.marvell.com/docs/embedded-processors/assets/marvell-embedded-processors-armada-38x-hardware-design-guide-2017-08.pdf
|
||||
|
||||
Core:
|
||||
ARM Cortex-A9
|
||||
|
||||
Armada 39x Flavors:
|
||||
- 88F6920 Armada 390
|
||||
- 88F6925 Armada 395
|
||||
- 88F6928 Armada 398
|
||||
|
||||
- Product infos: https://web.archive.org/web/20181020222559/http://www.marvell.com/embedded-processors/armada-39x/
|
||||
|
155
Documentation/arm64/asymmetric-32bit.rst
Normal file
155
Documentation/arm64/asymmetric-32bit.rst
Normal file
@ -0,0 +1,155 @@
|
||||
======================
|
||||
Asymmetric 32-bit SoCs
|
||||
======================
|
||||
|
||||
Author: Will Deacon <will@kernel.org>
|
||||
|
||||
This document describes the impact of asymmetric 32-bit SoCs on the
|
||||
execution of 32-bit (``AArch32``) applications.
|
||||
|
||||
Date: 2021-05-17
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Some Armv9 SoCs suffer from a big.LITTLE misfeature where only a subset
|
||||
of the CPUs are capable of executing 32-bit user applications. On such
|
||||
a system, Linux by default treats the asymmetry as a "mismatch" and
|
||||
disables support for both the ``PER_LINUX32`` personality and
|
||||
``execve(2)`` of 32-bit ELF binaries, with the latter returning
|
||||
``-ENOEXEC``. If the mismatch is detected during late onlining of a
|
||||
64-bit-only CPU, then the onlining operation fails and the new CPU is
|
||||
unavailable for scheduling.
|
||||
|
||||
Surprisingly, these SoCs have been produced with the intention of
|
||||
running legacy 32-bit binaries. Unsurprisingly, that doesn't work very
|
||||
well with the default behaviour of Linux.
|
||||
|
||||
It seems inevitable that future SoCs will drop 32-bit support
|
||||
altogether, so if you're stuck in the unenviable position of needing to
|
||||
run 32-bit code on one of these transitionary platforms then you would
|
||||
be wise to consider alternatives such as recompilation, emulation or
|
||||
retirement. If neither of those options are practical, then read on.
|
||||
|
||||
Enabling kernel support
|
||||
=======================
|
||||
|
||||
Since the kernel support is not completely transparent to userspace,
|
||||
allowing 32-bit tasks to run on an asymmetric 32-bit system requires an
|
||||
explicit "opt-in" and can be enabled by passing the
|
||||
``allow_mismatched_32bit_el0`` parameter on the kernel command-line.
|
||||
|
||||
For the remainder of this document we will refer to an *asymmetric
|
||||
system* to mean an asymmetric 32-bit SoC running Linux with this kernel
|
||||
command-line option enabled.
|
||||
|
||||
Userspace impact
|
||||
================
|
||||
|
||||
32-bit tasks running on an asymmetric system behave in mostly the same
|
||||
way as on a homogeneous system, with a few key differences relating to
|
||||
CPU affinity.
|
||||
|
||||
sysfs
|
||||
-----
|
||||
|
||||
The subset of CPUs capable of running 32-bit tasks is described in
|
||||
``/sys/devices/system/cpu/aarch32_el0`` and is documented further in
|
||||
``Documentation/ABI/testing/sysfs-devices-system-cpu``.
|
||||
|
||||
**Note:** CPUs are advertised by this file as they are detected and so
|
||||
late-onlining of 32-bit-capable CPUs can result in the file contents
|
||||
being modified by the kernel at runtime. Once advertised, CPUs are never
|
||||
removed from the file.
|
||||
|
||||
``execve(2)``
|
||||
-------------
|
||||
|
||||
On a homogeneous system, the CPU affinity of a task is preserved across
|
||||
``execve(2)``. This is not always possible on an asymmetric system,
|
||||
specifically when the new program being executed is 32-bit yet the
|
||||
affinity mask contains 64-bit-only CPUs. In this situation, the kernel
|
||||
determines the new affinity mask as follows:
|
||||
|
||||
1. If the 32-bit-capable subset of the affinity mask is not empty,
|
||||
then the affinity is restricted to that subset and the old affinity
|
||||
mask is saved. This saved mask is inherited over ``fork(2)`` and
|
||||
preserved across ``execve(2)`` of 32-bit programs.
|
||||
|
||||
**Note:** This step does not apply to ``SCHED_DEADLINE`` tasks.
|
||||
See `SCHED_DEADLINE`_.
|
||||
|
||||
2. Otherwise, the cpuset hierarchy of the task is walked until an
|
||||
ancestor is found containing at least one 32-bit-capable CPU. The
|
||||
affinity of the task is then changed to match the 32-bit-capable
|
||||
subset of the cpuset determined by the walk.
|
||||
|
||||
3. On failure (i.e. out of memory), the affinity is changed to the set
|
||||
of all 32-bit-capable CPUs of which the kernel is aware.
|
||||
|
||||
A subsequent ``execve(2)`` of a 64-bit program by the 32-bit task will
|
||||
invalidate the affinity mask saved in (1) and attempt to restore the CPU
|
||||
affinity of the task using the saved mask if it was previously valid.
|
||||
This restoration may fail due to intervening changes to the deadline
|
||||
policy or cpuset hierarchy, in which case the ``execve(2)`` continues
|
||||
with the affinity unchanged.
|
||||
|
||||
Calls to ``sched_setaffinity(2)`` for a 32-bit task will consider only
|
||||
the 32-bit-capable CPUs of the requested affinity mask. On success, the
|
||||
affinity for the task is updated and any saved mask from a prior
|
||||
``execve(2)`` is invalidated.
|
||||
|
||||
``SCHED_DEADLINE``
|
||||
------------------
|
||||
|
||||
Explicit admission of a 32-bit deadline task to the default root domain
|
||||
(e.g. by calling ``sched_setattr(2)``) is rejected on an asymmetric
|
||||
32-bit system unless admission control is disabled by writing -1 to
|
||||
``/proc/sys/kernel/sched_rt_runtime_us``.
|
||||
|
||||
``execve(2)`` of a 32-bit program from a 64-bit deadline task will
|
||||
return ``-ENOEXEC`` if the root domain for the task contains any
|
||||
64-bit-only CPUs and admission control is enabled. Concurrent offlining
|
||||
of 32-bit-capable CPUs may still necessitate the procedure described in
|
||||
`execve(2)`_, in which case step (1) is skipped and a warning is
|
||||
emitted on the console.
|
||||
|
||||
**Note:** It is recommended that a set of 32-bit-capable CPUs are placed
|
||||
into a separate root domain if ``SCHED_DEADLINE`` is to be used with
|
||||
32-bit tasks on an asymmetric system. Failure to do so is likely to
|
||||
result in missed deadlines.
|
||||
|
||||
Cpusets
|
||||
-------
|
||||
|
||||
The affinity of a 32-bit task on an asymmetric system may include CPUs
|
||||
that are not explicitly allowed by the cpuset to which it is attached.
|
||||
This can occur as a result of the following two situations:
|
||||
|
||||
- A 64-bit task attached to a cpuset which allows only 64-bit CPUs
|
||||
executes a 32-bit program.
|
||||
|
||||
- All of the 32-bit-capable CPUs allowed by a cpuset containing a
|
||||
32-bit task are offlined.
|
||||
|
||||
In both of these cases, the new affinity is calculated according to step
|
||||
(2) of the process described in `execve(2)`_ and the cpuset hierarchy is
|
||||
unchanged irrespective of the cgroup version.
|
||||
|
||||
CPU hotplug
|
||||
-----------
|
||||
|
||||
On an asymmetric system, the first detected 32-bit-capable CPU is
|
||||
prevented from being offlined by userspace and any such attempt will
|
||||
return ``-EPERM``. Note that suspend is still permitted even if the
|
||||
primary CPU (i.e. CPU 0) is 64-bit-only.
|
||||
|
||||
KVM
|
||||
---
|
||||
|
||||
Although KVM will not advertise 32-bit EL0 support to any vCPUs on an
|
||||
asymmetric system, a broken guest at EL1 could still attempt to execute
|
||||
32-bit code at EL0. In this case, an exit from a vCPU thread in 32-bit
|
||||
mode will return to host userspace with an ``exit_reason`` of
|
||||
``KVM_EXIT_FAIL_ENTRY`` and will remain non-runnable until successfully
|
||||
re-initialised by a subsequent ``KVM_ARM_VCPU_INIT`` operation.
|
@ -207,10 +207,17 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
software at a higher exception level to prevent execution in an UNKNOWN
|
||||
state.
|
||||
|
||||
- SCR_EL3.FIQ must have the same value across all CPUs the kernel is
|
||||
executing on.
|
||||
- The value of SCR_EL3.FIQ must be the same as the one present at boot
|
||||
time whenever the kernel is executing.
|
||||
For all systems:
|
||||
- If EL3 is present:
|
||||
|
||||
- SCR_EL3.FIQ must have the same value across all CPUs the kernel is
|
||||
executing on.
|
||||
- The value of SCR_EL3.FIQ must be the same as the one present at boot
|
||||
time whenever the kernel is executing.
|
||||
|
||||
- If EL3 is present and the kernel is entered at EL2:
|
||||
|
||||
- SCR_EL3.HCE (bit 8) must be initialised to 0b1.
|
||||
|
||||
For systems with a GICv3 interrupt controller to be used in v3 mode:
|
||||
- If EL3 is present:
|
||||
@ -311,6 +318,28 @@ Before jumping into the kernel, the following conditions must be met:
|
||||
- ZCR_EL2.LEN must be initialised to the same value for all CPUs the
|
||||
kernel will execute on.
|
||||
|
||||
For CPUs with the Scalable Matrix Extension (FEAT_SME):
|
||||
|
||||
- If EL3 is present:
|
||||
|
||||
- CPTR_EL3.ESM (bit 12) must be initialised to 0b1.
|
||||
|
||||
- SCR_EL3.EnTP2 (bit 41) must be initialised to 0b1.
|
||||
|
||||
- SMCR_EL3.LEN must be initialised to the same value for all CPUs the
|
||||
kernel will execute on.
|
||||
|
||||
- If the kernel is entered at EL1 and EL2 is present:
|
||||
|
||||
- CPTR_EL2.TSM (bit 12) must be initialised to 0b0.
|
||||
|
||||
- CPTR_EL2.SMEN (bits 25:24) must be initialised to 0b11.
|
||||
|
||||
- SCTLR_EL2.EnTP2 (bit 60) must be initialised to 0b1.
|
||||
|
||||
- SMCR_EL2.LEN must be initialised to the same value for all CPUs the
|
||||
kernel will execute on.
|
||||
|
||||
The requirements described above for CPU mode, caches, MMUs, architected
|
||||
timers, coherency and system registers apply to all CPUs. All CPUs must
|
||||
enter the kernel in the same exception level. Where the values documented
|
||||
|
@ -10,6 +10,7 @@ ARM64 Architecture
|
||||
acpi_object_usage
|
||||
amu
|
||||
arm-acpi
|
||||
asymmetric-32bit
|
||||
booting
|
||||
cpu-feature-registers
|
||||
elf_hwcaps
|
||||
|
@ -77,14 +77,20 @@ configurable behaviours:
|
||||
address is unknown).
|
||||
|
||||
The user can select the above modes, per thread, using the
|
||||
``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where
|
||||
``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK``
|
||||
``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where ``flags``
|
||||
contains any number of the following values in the ``PR_MTE_TCF_MASK``
|
||||
bit-field:
|
||||
|
||||
- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults
|
||||
- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults
|
||||
(ignored if combined with other options)
|
||||
- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode
|
||||
- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode
|
||||
|
||||
If no modes are specified, tag check faults are ignored. If a single
|
||||
mode is specified, the program will run in that mode. If multiple
|
||||
modes are specified, the mode is selected as described in the "Per-CPU
|
||||
preferred tag checking modes" section below.
|
||||
|
||||
The current tag check fault mode can be read using the
|
||||
``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call.
|
||||
|
||||
@ -120,13 +126,39 @@ in the ``PR_MTE_TAG_MASK`` bit-field.
|
||||
interface provides an include mask. An include mask of ``0`` (exclusion
|
||||
mask ``0xffff``) results in the CPU always generating tag ``0``.
|
||||
|
||||
Per-CPU preferred tag checking mode
|
||||
-----------------------------------
|
||||
|
||||
On some CPUs the performance of MTE in stricter tag checking modes
|
||||
is similar to that of less strict tag checking modes. This makes it
|
||||
worthwhile to enable stricter checks on those CPUs when a less strict
|
||||
checking mode is requested, in order to gain the error detection
|
||||
benefits of the stricter checks without the performance downsides. To
|
||||
support this scenario, a privileged user may configure a stricter
|
||||
tag checking mode as the CPU's preferred tag checking mode.
|
||||
|
||||
The preferred tag checking mode for each CPU is controlled by
|
||||
``/sys/devices/system/cpu/cpu<N>/mte_tcf_preferred``, to which a
|
||||
privileged user may write the value ``async`` or ``sync``. The default
|
||||
preferred mode for each CPU is ``async``.
|
||||
|
||||
To allow a program to potentially run in the CPU's preferred tag
|
||||
checking mode, the user program may set multiple tag check fault mode
|
||||
bits in the ``flags`` argument to the ``prctl(PR_SET_TAGGED_ADDR_CTRL,
|
||||
flags, 0, 0, 0)`` system call. If the CPU's preferred tag checking
|
||||
mode is in the task's set of provided tag checking modes (this will
|
||||
always be the case at present because the kernel only supports two
|
||||
tag checking modes, but future kernels may support more modes), that
|
||||
mode will be selected. Otherwise, one of the modes in the task's mode
|
||||
set will be selected in a currently unspecified manner.
|
||||
|
||||
Initial process state
|
||||
---------------------
|
||||
|
||||
On ``execve()``, the new process has the following configuration:
|
||||
|
||||
- ``PR_TAGGED_ADDR_ENABLE`` set to 0 (disabled)
|
||||
- Tag checking mode set to ``PR_MTE_TCF_NONE``
|
||||
- No tag checking modes are selected (tag check faults ignored)
|
||||
- ``PR_MTE_TAG_MASK`` set to 0 (all tags excluded)
|
||||
- ``PSTATE.TCO`` set to 0
|
||||
- ``PROT_MTE`` not set on any of the initial memory maps
|
||||
@ -251,11 +283,13 @@ Example of correct usage
|
||||
return EXIT_FAILURE;
|
||||
|
||||
/*
|
||||
* Enable the tagged address ABI, synchronous MTE tag check faults and
|
||||
* allow all non-zero tags in the randomly generated set.
|
||||
* Enable the tagged address ABI, synchronous or asynchronous MTE
|
||||
* tag check faults (based on per-CPU preference) and allow all
|
||||
* non-zero tags in the randomly generated set.
|
||||
*/
|
||||
if (prctl(PR_SET_TAGGED_ADDR_CTRL,
|
||||
PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT),
|
||||
PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | PR_MTE_TCF_ASYNC |
|
||||
(0xfffe << PR_MTE_TAG_SHIFT),
|
||||
0, 0, 0)) {
|
||||
perror("prctl() failed");
|
||||
return EXIT_FAILURE;
|
||||
|
@ -45,14 +45,24 @@ how the user addresses are used by the kernel:
|
||||
|
||||
1. User addresses not accessed by the kernel but used for address space
|
||||
management (e.g. ``mprotect()``, ``madvise()``). The use of valid
|
||||
tagged pointers in this context is allowed with the exception of
|
||||
``brk()``, ``mmap()`` and the ``new_address`` argument to
|
||||
``mremap()`` as these have the potential to alias with existing
|
||||
user addresses.
|
||||
tagged pointers in this context is allowed with these exceptions:
|
||||
|
||||
NOTE: This behaviour changed in v5.6 and so some earlier kernels may
|
||||
incorrectly accept valid tagged pointers for the ``brk()``,
|
||||
``mmap()`` and ``mremap()`` system calls.
|
||||
- ``brk()``, ``mmap()`` and the ``new_address`` argument to
|
||||
``mremap()`` as these have the potential to alias with existing
|
||||
user addresses.
|
||||
|
||||
NOTE: This behaviour changed in v5.6 and so some earlier kernels may
|
||||
incorrectly accept valid tagged pointers for the ``brk()``,
|
||||
``mmap()`` and ``mremap()`` system calls.
|
||||
|
||||
- The ``range.start``, ``start`` and ``dst`` arguments to the
|
||||
``UFFDIO_*`` ``ioctl()``s used on a file descriptor obtained from
|
||||
``userfaultfd()``, as fault addresses subsequently obtained by reading
|
||||
the file descriptor will be untagged, which may otherwise confuse
|
||||
tag-unaware programs.
|
||||
|
||||
NOTE: This behaviour changed in v5.14 and so some earlier kernels may
|
||||
incorrectly accept valid tagged pointers for this system call.
|
||||
|
||||
2. User addresses accessed by the kernel (e.g. ``write()``). This ABI
|
||||
relaxation is disabled by default and the application thread needs to
|
||||
|
@ -271,3 +271,97 @@ WRITE_ONCE. Thus:
|
||||
SC *y, t;
|
||||
|
||||
is allowed.
|
||||
|
||||
|
||||
CMPXCHG vs TRY_CMPXCHG
|
||||
----------------------
|
||||
|
||||
int atomic_cmpxchg(atomic_t *ptr, int old, int new);
|
||||
bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new);
|
||||
|
||||
Both provide the same functionality, but try_cmpxchg() can lead to more
|
||||
compact code. The functions relate like:
|
||||
|
||||
bool atomic_try_cmpxchg(atomic_t *ptr, int *oldp, int new)
|
||||
{
|
||||
int ret, old = *oldp;
|
||||
ret = atomic_cmpxchg(ptr, old, new);
|
||||
if (ret != old)
|
||||
*oldp = ret;
|
||||
return ret == old;
|
||||
}
|
||||
|
||||
and:
|
||||
|
||||
int atomic_cmpxchg(atomic_t *ptr, int old, int new)
|
||||
{
|
||||
(void)atomic_try_cmpxchg(ptr, &old, new);
|
||||
return old;
|
||||
}
|
||||
|
||||
Usage:
|
||||
|
||||
old = atomic_read(&v); old = atomic_read(&v);
|
||||
for (;;) { do {
|
||||
new = func(old); new = func(old);
|
||||
tmp = atomic_cmpxchg(&v, old, new); } while (!atomic_try_cmpxchg(&v, &old, new));
|
||||
if (tmp == old)
|
||||
break;
|
||||
old = tmp;
|
||||
}
|
||||
|
||||
NB. try_cmpxchg() also generates better code on some platforms (notably x86)
|
||||
where the function more closely matches the hardware instruction.
|
||||
|
||||
|
||||
FORWARD PROGRESS
|
||||
----------------
|
||||
|
||||
In general strong forward progress is expected of all unconditional atomic
|
||||
operations -- those in the Arithmetic and Bitwise classes and xchg(). However
|
||||
a fair amount of code also requires forward progress from the conditional
|
||||
atomic operations.
|
||||
|
||||
Specifically 'simple' cmpxchg() loops are expected to not starve one another
|
||||
indefinitely. However, this is not evident on LL/SC architectures, because
|
||||
while an LL/SC architecure 'can/should/must' provide forward progress
|
||||
guarantees between competing LL/SC sections, such a guarantee does not
|
||||
transfer to cmpxchg() implemented using LL/SC. Consider:
|
||||
|
||||
old = atomic_read(&v);
|
||||
do {
|
||||
new = func(old);
|
||||
} while (!atomic_try_cmpxchg(&v, &old, new));
|
||||
|
||||
which on LL/SC becomes something like:
|
||||
|
||||
old = atomic_read(&v);
|
||||
do {
|
||||
new = func(old);
|
||||
} while (!({
|
||||
volatile asm ("1: LL %[oldval], %[v]\n"
|
||||
" CMP %[oldval], %[old]\n"
|
||||
" BNE 2f\n"
|
||||
" SC %[new], %[v]\n"
|
||||
" BNE 1b\n"
|
||||
"2:\n"
|
||||
: [oldval] "=&r" (oldval), [v] "m" (v)
|
||||
: [old] "r" (old), [new] "r" (new)
|
||||
: "memory");
|
||||
success = (oldval == old);
|
||||
if (!success)
|
||||
old = oldval;
|
||||
success; }));
|
||||
|
||||
However, even the forward branch from the failed compare can cause the LL/SC
|
||||
to fail on some architectures, let alone whatever the compiler makes of the C
|
||||
loop body. As a result there is no guarantee what so ever the cacheline
|
||||
containing @v will stay on the local CPU and progress is made.
|
||||
|
||||
Even native CAS architectures can fail to provide forward progress for their
|
||||
primitive (See Sparc64 for an example).
|
||||
|
||||
Such implementations are strongly encouraged to add exponential backoff loops
|
||||
to a failed CAS in order to ensure some progress. Affected architectures are
|
||||
also strongly encouraged to inspect/audit the atomic fallbacks, refcount_t and
|
||||
their locking primitives.
|
||||
|
@ -54,7 +54,7 @@ layer or if we want to try to merge requests. In both cases, requests will be
|
||||
sent to the software queue.
|
||||
|
||||
Then, after the requests are processed by software queues, they will be placed
|
||||
at the hardware queue, a second stage queue were the hardware has direct access
|
||||
at the hardware queue, a second stage queue where the hardware has direct access
|
||||
to process those requests. However, if the hardware does not have enough
|
||||
resources to accept more requests, blk-mq will places requests on a temporary
|
||||
queue, to be sent in the future, when the hardware is able.
|
||||
|
@ -15,15 +15,7 @@ that goes into great technical depth about the BPF Architecture.
|
||||
libbpf
|
||||
======
|
||||
|
||||
Libbpf is a userspace library for loading and interacting with bpf programs.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
libbpf/libbpf
|
||||
libbpf/libbpf_api
|
||||
libbpf/libbpf_build
|
||||
libbpf/libbpf_naming_convention
|
||||
Documentation/bpf/libbpf/libbpf.rst is a userspace library for loading and interacting with bpf programs.
|
||||
|
||||
BPF Type Format (BTF)
|
||||
=====================
|
||||
|
22
Documentation/bpf/libbpf/index.rst
Normal file
22
Documentation/bpf/libbpf/index.rst
Normal file
@ -0,0 +1,22 @@
|
||||
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
|
||||
libbpf
|
||||
======
|
||||
|
||||
For API documentation see the `versioned API documentation site <https://libbpf.readthedocs.io/en/latest/api.html>`_.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
libbpf_naming_convention
|
||||
libbpf_build
|
||||
|
||||
This is documentation for libbpf, a userspace library for loading and
|
||||
interacting with bpf programs.
|
||||
|
||||
All general BPF questions, including kernel functionality, libbpf APIs and
|
||||
their application, should be sent to bpf@vger.kernel.org mailing list.
|
||||
You can `subscribe <http://vger.kernel.org/vger-lists.html#bpf>`_ to the
|
||||
mailing list search its `archive <https://lore.kernel.org/bpf/>`_.
|
||||
Please search the archive before asking new questions. It very well might
|
||||
be that this was already addressed or answered before.
|
@ -1,14 +0,0 @@
|
||||
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
|
||||
libbpf
|
||||
======
|
||||
|
||||
This is documentation for libbpf, a userspace library for loading and
|
||||
interacting with bpf programs.
|
||||
|
||||
All general BPF questions, including kernel functionality, libbpf APIs and
|
||||
their application, should be sent to bpf@vger.kernel.org mailing list.
|
||||
You can `subscribe <http://vger.kernel.org/vger-lists.html#bpf>`_ to the
|
||||
mailing list search its `archive <https://lore.kernel.org/bpf/>`_.
|
||||
Please search the archive before asking new questions. It very well might
|
||||
be that this was already addressed or answered before.
|
@ -1,27 +0,0 @@
|
||||
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
|
||||
|
||||
API
|
||||
===
|
||||
|
||||
This documentation is autogenerated from header files in libbpf, tools/lib/bpf
|
||||
|
||||
.. kernel-doc:: tools/lib/bpf/libbpf.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: tools/lib/bpf/bpf.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: tools/lib/bpf/btf.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: tools/lib/bpf/xsk.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: tools/lib/bpf/bpf_tracing.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: tools/lib/bpf/bpf_core_read.h
|
||||
:internal:
|
||||
|
||||
.. kernel-doc:: tools/lib/bpf/bpf_endian.h
|
||||
:internal:
|
@ -69,7 +69,7 @@ functions. These can be mixed and matched. Note that these functions
|
||||
are not reentrant for performance reasons.
|
||||
|
||||
ABI
|
||||
==========
|
||||
---
|
||||
|
||||
libbpf can be both linked statically or used as DSO. To avoid possible
|
||||
conflicts with other libraries an application is linked with, all
|
||||
@ -108,7 +108,7 @@ This bump in ABI version is at most once per kernel development cycle.
|
||||
|
||||
For example, if current state of ``libbpf.map`` is:
|
||||
|
||||
.. code-block:: c
|
||||
.. code-block:: none
|
||||
|
||||
LIBBPF_0.0.1 {
|
||||
global:
|
||||
@ -121,7 +121,7 @@ For example, if current state of ``libbpf.map`` is:
|
||||
, and a new symbol ``bpf_func_c`` is being introduced, then
|
||||
``libbpf.map`` should be changed like this:
|
||||
|
||||
.. code-block:: c
|
||||
.. code-block:: none
|
||||
|
||||
LIBBPF_0.0.1 {
|
||||
global:
|
||||
|
@ -16,8 +16,6 @@ import sys
|
||||
import os
|
||||
import sphinx
|
||||
|
||||
from subprocess import check_output
|
||||
|
||||
# Get Sphinx version
|
||||
major, minor, patch = sphinx.version_info[:3]
|
||||
|
||||
@ -343,6 +341,9 @@ latex_elements = {
|
||||
verbatimhintsturnover=false,
|
||||
''',
|
||||
|
||||
# For CJK One-half spacing, need to be in front of hyperref
|
||||
'extrapackages': r'\usepackage{setspace}',
|
||||
|
||||
# Additional stuff for the LaTeX preamble.
|
||||
'preamble': '''
|
||||
% Prevent column squeezing of tabulary.
|
||||
@ -355,29 +356,117 @@ latex_elements = {
|
||||
''',
|
||||
}
|
||||
|
||||
# At least one book (translations) may have Asian characters
|
||||
# with are only displayed if xeCJK is used
|
||||
# Translations have Asian (CJK) characters which are only displayed if
|
||||
# xeCJK is used
|
||||
|
||||
cjk_cmd = check_output(['fc-list', '--format="%{family[0]}\n"']).decode('utf-8', 'ignore')
|
||||
if cjk_cmd.find("Noto Sans CJK SC") >= 0:
|
||||
latex_elements['preamble'] += '''
|
||||
latex_elements['preamble'] += '''
|
||||
\\IfFontExistsTF{Noto Sans CJK SC}{
|
||||
% This is needed for translations
|
||||
\\usepackage{xeCJK}
|
||||
\\setCJKmainfont{Noto Sans CJK SC}
|
||||
\\usepackage{xeCJK}
|
||||
\\IfFontExistsTF{Noto Serif CJK SC}{
|
||||
\\setCJKmainfont{Noto Serif CJK SC}[AutoFakeSlant]
|
||||
}{
|
||||
\\setCJKmainfont{Noto Sans CJK SC}[AutoFakeSlant]
|
||||
}
|
||||
\\setCJKsansfont{Noto Sans CJK SC}[AutoFakeSlant]
|
||||
\\setCJKmonofont{Noto Sans Mono CJK SC}[AutoFakeSlant]
|
||||
% CJK Language-specific font choices
|
||||
\\IfFontExistsTF{Noto Serif CJK SC}{
|
||||
\\newCJKfontfamily[SCmain]\\scmain{Noto Serif CJK SC}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[SCserif]\\scserif{Noto Serif CJK SC}[AutoFakeSlant]
|
||||
}{
|
||||
\\newCJKfontfamily[SCmain]\\scmain{Noto Sans CJK SC}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[SCserif]\\scserif{Noto Sans CJK SC}[AutoFakeSlant]
|
||||
}
|
||||
\\newCJKfontfamily[SCsans]\\scsans{Noto Sans CJK SC}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[SCmono]\\scmono{Noto Sans Mono CJK SC}[AutoFakeSlant]
|
||||
\\IfFontExistsTF{Noto Serif CJK TC}{
|
||||
\\newCJKfontfamily[TCmain]\\tcmain{Noto Serif CJK TC}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[TCserif]\\tcserif{Noto Serif CJK TC}[AutoFakeSlant]
|
||||
}{
|
||||
\\newCJKfontfamily[TCmain]\\tcmain{Noto Sans CJK TC}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[TCserif]\\tcserif{Noto Sans CJK TC}[AutoFakeSlant]
|
||||
}
|
||||
\\newCJKfontfamily[TCsans]\\tcsans{Noto Sans CJK TC}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[TCmono]\\tcmono{Noto Sans Mono CJK TC}[AutoFakeSlant]
|
||||
\\IfFontExistsTF{Noto Serif CJK KR}{
|
||||
\\newCJKfontfamily[KRmain]\\krmain{Noto Serif CJK KR}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[KRserif]\\krserif{Noto Serif CJK KR}[AutoFakeSlant]
|
||||
}{
|
||||
\\newCJKfontfamily[KRmain]\\krmain{Noto Sans CJK KR}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[KRserif]\\krserif{Noto Sans CJK KR}[AutoFakeSlant]
|
||||
}
|
||||
\\newCJKfontfamily[KRsans]\\krsans{Noto Sans CJK KR}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[KRmono]\\krmono{Noto Sans Mono CJK KR}[AutoFakeSlant]
|
||||
\\IfFontExistsTF{Noto Serif CJK JP}{
|
||||
\\newCJKfontfamily[JPmain]\\jpmain{Noto Serif CJK JP}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[JPserif]\\jpserif{Noto Serif CJK JP}[AutoFakeSlant]
|
||||
}{
|
||||
\\newCJKfontfamily[JPmain]\\jpmain{Noto Sans CJK JP}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[JPserif]\\jpserif{Noto Sans CJK JP}[AutoFakeSlant]
|
||||
}
|
||||
\\newCJKfontfamily[JPsans]\\jpsans{Noto Sans CJK JP}[AutoFakeSlant]
|
||||
\\newCJKfontfamily[JPmono]\\jpmono{Noto Sans Mono CJK JP}[AutoFakeSlant]
|
||||
% Dummy commands for Sphinx < 2.3 (no 'extrapackages' support)
|
||||
\\providecommand{\\onehalfspacing}{}
|
||||
\\providecommand{\\singlespacing}{}
|
||||
% Define custom macros to on/off CJK
|
||||
\\newcommand{\\kerneldocCJKon}{\\makexeCJKactive}
|
||||
\\newcommand{\\kerneldocCJKoff}{\\makexeCJKinactive}
|
||||
% To customize \sphinxtableofcontents
|
||||
\\newcommand{\\kerneldocCJKon}{\\makexeCJKactive\\onehalfspacing}
|
||||
\\newcommand{\\kerneldocCJKoff}{\\makexeCJKinactive\\singlespacing}
|
||||
\\newcommand{\\kerneldocBeginSC}{%
|
||||
\\begingroup%
|
||||
\\scmain%
|
||||
}
|
||||
\\newcommand{\\kerneldocEndSC}{\\endgroup}
|
||||
\\newcommand{\\kerneldocBeginTC}{%
|
||||
\\begingroup%
|
||||
\\tcmain%
|
||||
\\renewcommand{\\CJKrmdefault}{TCserif}%
|
||||
\\renewcommand{\\CJKsfdefault}{TCsans}%
|
||||
\\renewcommand{\\CJKttdefault}{TCmono}%
|
||||
}
|
||||
\\newcommand{\\kerneldocEndTC}{\\endgroup}
|
||||
\\newcommand{\\kerneldocBeginKR}{%
|
||||
\\begingroup%
|
||||
\\xeCJKDeclareCharClass{HalfLeft}{`“,`‘}%
|
||||
\\xeCJKDeclareCharClass{HalfRight}{`”,`’}%
|
||||
\\krmain%
|
||||
\\renewcommand{\\CJKrmdefault}{KRserif}%
|
||||
\\renewcommand{\\CJKsfdefault}{KRsans}%
|
||||
\\renewcommand{\\CJKttdefault}{KRmono}%
|
||||
\\xeCJKsetup{CJKspace = true} % For inter-phrase space
|
||||
}
|
||||
\\newcommand{\\kerneldocEndKR}{\\endgroup}
|
||||
\\newcommand{\\kerneldocBeginJP}{%
|
||||
\\begingroup%
|
||||
\\xeCJKDeclareCharClass{HalfLeft}{`“,`‘}%
|
||||
\\xeCJKDeclareCharClass{HalfRight}{`”,`’}%
|
||||
\\jpmain%
|
||||
\\renewcommand{\\CJKrmdefault}{JPserif}%
|
||||
\\renewcommand{\\CJKsfdefault}{JPsans}%
|
||||
\\renewcommand{\\CJKttdefault}{JPmono}%
|
||||
}
|
||||
\\newcommand{\\kerneldocEndJP}{\\endgroup}
|
||||
% Single spacing in literal blocks
|
||||
\\fvset{baselinestretch=1}
|
||||
% To customize \\sphinxtableofcontents
|
||||
\\usepackage{etoolbox}
|
||||
% Inactivate CJK after tableofcontents
|
||||
\\apptocmd{\\sphinxtableofcontents}{\\kerneldocCJKoff}{}{}
|
||||
'''
|
||||
else:
|
||||
latex_elements['preamble'] += '''
|
||||
}{ % No CJK font found
|
||||
% Custom macros to on/off CJK (Dummy)
|
||||
\\newcommand{\\kerneldocCJKon}{}
|
||||
\\newcommand{\\kerneldocCJKoff}{}
|
||||
'''
|
||||
\\newcommand{\\kerneldocBeginSC}{}
|
||||
\\newcommand{\\kerneldocEndSC}{}
|
||||
\\newcommand{\\kerneldocBeginTC}{}
|
||||
\\newcommand{\\kerneldocEndTC}{}
|
||||
\\newcommand{\\kerneldocBeginKR}{}
|
||||
\\newcommand{\\kerneldocEndKR}{}
|
||||
\\newcommand{\\kerneldocBeginJP}{}
|
||||
\\newcommand{\\kerneldocEndJP}{}
|
||||
}
|
||||
'''
|
||||
|
||||
# Fix reference escape troubles with Sphinx 1.4.x
|
||||
if major == 1:
|
||||
|
@ -271,10 +271,15 @@ maps this page at its virtual address.
|
||||
|
||||
``void flush_dcache_page(struct page *page)``
|
||||
|
||||
Any time the kernel writes to a page cache page, _OR_
|
||||
the kernel is about to read from a page cache page and
|
||||
user space shared/writable mappings of this page potentially
|
||||
exist, this routine is called.
|
||||
This routines must be called when:
|
||||
|
||||
a) the kernel did write to a page that is in the page cache page
|
||||
and / or in high memory
|
||||
b) the kernel is about to read from a page cache page and user space
|
||||
shared/writable mappings of this page potentially exist. Note
|
||||
that {get,pin}_user_pages{_fast} already call flush_dcache_page
|
||||
on any page found in the user address space and thus driver
|
||||
code rarely needs to take this into account.
|
||||
|
||||
.. note::
|
||||
|
||||
@ -284,38 +289,34 @@ maps this page at its virtual address.
|
||||
handling vfs symlinks in the page cache need not call
|
||||
this interface at all.
|
||||
|
||||
The phrase "kernel writes to a page cache page" means,
|
||||
specifically, that the kernel executes store instructions
|
||||
that dirty data in that page at the page->virtual mapping
|
||||
of that page. It is important to flush here to handle
|
||||
D-cache aliasing, to make sure these kernel stores are
|
||||
visible to user space mappings of that page.
|
||||
The phrase "kernel writes to a page cache page" means, specifically,
|
||||
that the kernel executes store instructions that dirty data in that
|
||||
page at the page->virtual mapping of that page. It is important to
|
||||
flush here to handle D-cache aliasing, to make sure these kernel stores
|
||||
are visible to user space mappings of that page.
|
||||
|
||||
The corollary case is just as important, if there are users
|
||||
which have shared+writable mappings of this file, we must make
|
||||
sure that kernel reads of these pages will see the most recent
|
||||
stores done by the user.
|
||||
The corollary case is just as important, if there are users which have
|
||||
shared+writable mappings of this file, we must make sure that kernel
|
||||
reads of these pages will see the most recent stores done by the user.
|
||||
|
||||
If D-cache aliasing is not an issue, this routine may
|
||||
simply be defined as a nop on that architecture.
|
||||
If D-cache aliasing is not an issue, this routine may simply be defined
|
||||
as a nop on that architecture.
|
||||
|
||||
There is a bit set aside in page->flags (PG_arch_1) as
|
||||
"architecture private". The kernel guarantees that,
|
||||
for pagecache pages, it will clear this bit when such
|
||||
a page first enters the pagecache.
|
||||
There is a bit set aside in page->flags (PG_arch_1) as "architecture
|
||||
private". The kernel guarantees that, for pagecache pages, it will
|
||||
clear this bit when such a page first enters the pagecache.
|
||||
|
||||
This allows these interfaces to be implemented much more
|
||||
efficiently. It allows one to "defer" (perhaps indefinitely)
|
||||
the actual flush if there are currently no user processes
|
||||
mapping this page. See sparc64's flush_dcache_page and
|
||||
update_mmu_cache implementations for an example of how to go
|
||||
about doing this.
|
||||
This allows these interfaces to be implemented much more efficiently.
|
||||
It allows one to "defer" (perhaps indefinitely) the actual flush if
|
||||
there are currently no user processes mapping this page. See sparc64's
|
||||
flush_dcache_page and update_mmu_cache implementations for an example
|
||||
of how to go about doing this.
|
||||
|
||||
The idea is, first at flush_dcache_page() time, if
|
||||
page->mapping->i_mmap is an empty tree, just mark the architecture
|
||||
private page flag bit. Later, in update_mmu_cache(), a check is
|
||||
made of this flag bit, and if set the flush is done and the flag
|
||||
bit is cleared.
|
||||
The idea is, first at flush_dcache_page() time, if page_file_mapping()
|
||||
returns a mapping, and mapping_mapped on that mapping returns %false,
|
||||
just mark the architecture private page flag bit. Later, in
|
||||
update_mmu_cache(), a check is made of this flag bit, and if set the
|
||||
flush is done and the flag bit is cleared.
|
||||
|
||||
.. important::
|
||||
|
||||
@ -351,19 +352,6 @@ maps this page at its virtual address.
|
||||
architectures). For incoherent architectures, it should flush
|
||||
the cache of the page at vmaddr.
|
||||
|
||||
``void flush_kernel_dcache_page(struct page *page)``
|
||||
|
||||
When the kernel needs to modify a user page is has obtained
|
||||
with kmap, it calls this function after all modifications are
|
||||
complete (but before kunmapping it) to bring the underlying
|
||||
page up to date. It is assumed here that the user has no
|
||||
incoherent cached copies (i.e. the original page was obtained
|
||||
from a mechanism like get_user_pages()). The default
|
||||
implementation is a nop and should remain so on all coherent
|
||||
architectures. On incoherent architectures, this should flush
|
||||
the kernel cache for page (using page_address(page)).
|
||||
|
||||
|
||||
``void flush_icache_range(unsigned long start, unsigned long end)``
|
||||
|
||||
When the kernel stores into addresses that it will execute
|
||||
|
@ -2,12 +2,13 @@
|
||||
CPU hotplug in the Kernel
|
||||
=========================
|
||||
|
||||
:Date: December, 2016
|
||||
:Date: September, 2021
|
||||
:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
|
||||
Rusty Russell <rusty@rustcorp.com.au>,
|
||||
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
|
||||
Ashok Raj <ashok.raj@intel.com>,
|
||||
Joel Schopp <jschopp@austin.ibm.com>
|
||||
Rusty Russell <rusty@rustcorp.com.au>,
|
||||
Srivatsa Vaddagiri <vatsa@in.ibm.com>,
|
||||
Ashok Raj <ashok.raj@intel.com>,
|
||||
Joel Schopp <jschopp@austin.ibm.com>,
|
||||
Thomas Gleixner <tglx@linutronix.de>
|
||||
|
||||
Introduction
|
||||
============
|
||||
@ -91,9 +92,10 @@ Never use anything other than ``cpumask_t`` to represent bitmap of CPUs.
|
||||
|
||||
Using CPU hotplug
|
||||
=================
|
||||
|
||||
The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently
|
||||
available on multiple architectures including ARM, MIPS, PowerPC and X86. The
|
||||
configuration is done via the sysfs interface: ::
|
||||
configuration is done via the sysfs interface::
|
||||
|
||||
$ ls -lh /sys/devices/system/cpu
|
||||
total 0
|
||||
@ -113,14 +115,14 @@ configuration is done via the sysfs interface: ::
|
||||
|
||||
The files *offline*, *online*, *possible*, *present* represent the CPU masks.
|
||||
Each CPU folder contains an *online* file which controls the logical on (1) and
|
||||
off (0) state. To logically shutdown CPU4: ::
|
||||
off (0) state. To logically shutdown CPU4::
|
||||
|
||||
$ echo 0 > /sys/devices/system/cpu/cpu4/online
|
||||
smpboot: CPU 4 is now offline
|
||||
|
||||
Once the CPU is shutdown, it will be removed from */proc/interrupts*,
|
||||
*/proc/cpuinfo* and should also not be shown visible by the *top* command. To
|
||||
bring CPU4 back online: ::
|
||||
bring CPU4 back online::
|
||||
|
||||
$ echo 1 > /sys/devices/system/cpu/cpu4/online
|
||||
smpboot: Booting Node 0 Processor 4 APIC 0x1
|
||||
@ -142,6 +144,7 @@ The CPU hotplug coordination
|
||||
|
||||
The offline case
|
||||
----------------
|
||||
|
||||
Once a CPU has been logically shutdown the teardown callbacks of registered
|
||||
hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating
|
||||
at state ``CPUHP_OFFLINE``. This includes:
|
||||
@ -156,105 +159,491 @@ at state ``CPUHP_OFFLINE``. This includes:
|
||||
* Once all services are migrated, kernel calls an arch specific routine
|
||||
``__cpu_disable()`` to perform arch specific cleanup.
|
||||
|
||||
Using the hotplug API
|
||||
---------------------
|
||||
It is possible to receive notifications once a CPU is offline or onlined. This
|
||||
might be important to certain drivers which need to perform some kind of setup
|
||||
or clean up functions based on the number of available CPUs: ::
|
||||
|
||||
#include <linux/cpuhotplug.h>
|
||||
The CPU hotplug API
|
||||
===================
|
||||
|
||||
ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online",
|
||||
Y_online, Y_prepare_down);
|
||||
CPU hotplug state machine
|
||||
-------------------------
|
||||
|
||||
*X* is the subsystem and *Y* the particular driver. The *Y_online* callback
|
||||
will be invoked during registration on all online CPUs. If an error
|
||||
occurs during the online callback the *Y_prepare_down* callback will be
|
||||
invoked on all CPUs on which the online callback was previously invoked.
|
||||
After registration completed, the *Y_online* callback will be invoked
|
||||
once a CPU is brought online and *Y_prepare_down* will be invoked when a
|
||||
CPU is shutdown. All resources which were previously allocated in
|
||||
*Y_online* should be released in *Y_prepare_down*.
|
||||
The return value *ret* is negative if an error occurred during the
|
||||
registration process. Otherwise a positive value is returned which
|
||||
contains the allocated hotplug for dynamically allocated states
|
||||
(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states.
|
||||
CPU hotplug uses a trivial state machine with a linear state space from
|
||||
CPUHP_OFFLINE to CPUHP_ONLINE. Each state has a startup and a teardown
|
||||
callback.
|
||||
|
||||
The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a
|
||||
dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state.
|
||||
During the removal of a hotplug state the teardown callback will be invoked.
|
||||
When a CPU is onlined, the startup callbacks are invoked sequentially until
|
||||
the state CPUHP_ONLINE is reached. They can also be invoked when the
|
||||
callbacks of a state are set up or an instance is added to a multi-instance
|
||||
state.
|
||||
|
||||
Multiple instances
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
If a driver has multiple instances and each instance needs to perform the
|
||||
callback independently then it is likely that a ''multi-state'' should be used.
|
||||
First a multi-state state needs to be registered: ::
|
||||
When a CPU is offlined the teardown callbacks are invoked in the reverse
|
||||
order sequentially until the state CPUHP_OFFLINE is reached. They can also
|
||||
be invoked when the callbacks of a state are removed or an instance is
|
||||
removed from a multi-instance state.
|
||||
|
||||
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online,
|
||||
Y_online, Y_prepare_down);
|
||||
Y_hp_online = ret;
|
||||
If a usage site requires only a callback in one direction of the hotplug
|
||||
operations (CPU online or CPU offline) then the other not-required callback
|
||||
can be set to NULL when the state is set up.
|
||||
|
||||
The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()``
|
||||
except it prepares the callbacks for a multi state and does not invoke
|
||||
the callbacks. This is a one time setup.
|
||||
Once a new instance is allocated, you need to register this new instance: ::
|
||||
The state space is divided into three sections:
|
||||
|
||||
ret = cpuhp_state_add_instance(Y_hp_online, &d->node);
|
||||
* The PREPARE section
|
||||
|
||||
This function will add this instance to your previously allocated
|
||||
*Y_hp_online* state and invoke the previously registered callback
|
||||
(*Y_online*) on all online CPUs. The *node* element is a ``struct
|
||||
hlist_node`` member of your per-instance data structure.
|
||||
The PREPARE section covers the state space from CPUHP_OFFLINE to
|
||||
CPUHP_BRINGUP_CPU.
|
||||
|
||||
On removal of the instance: ::
|
||||
cpuhp_state_remove_instance(Y_hp_online, &d->node)
|
||||
The startup callbacks in this section are invoked before the CPU is
|
||||
started during a CPU online operation. The teardown callbacks are invoked
|
||||
after the CPU has become dysfunctional during a CPU offline operation.
|
||||
|
||||
should be invoked which will invoke the teardown callback on all online
|
||||
CPUs.
|
||||
The callbacks are invoked on a control CPU as they can't obviously run on
|
||||
the hotplugged CPU which is either not yet started or has become
|
||||
dysfunctional already.
|
||||
|
||||
Manual setup
|
||||
~~~~~~~~~~~~
|
||||
Usually it is handy to invoke setup and teardown callbacks on registration or
|
||||
removal of a state because usually the operation needs to performed once a CPU
|
||||
goes online (offline) and during initial setup (shutdown) of the driver. However
|
||||
each registration and removal function is also available with a ``_nocalls``
|
||||
suffix which does not invoke the provided callbacks if the invocation of the
|
||||
callbacks is not desired. During the manual setup (or teardown) the functions
|
||||
``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU
|
||||
hotplug operations.
|
||||
The startup callbacks are used to setup resources which are required to
|
||||
bring a CPU successfully online. The teardown callbacks are used to free
|
||||
resources or to move pending work to an online CPU after the hotplugged
|
||||
CPU became dysfunctional.
|
||||
|
||||
The startup callbacks are allowed to fail. If a callback fails, the CPU
|
||||
online operation is aborted and the CPU is brought down to the previous
|
||||
state (usually CPUHP_OFFLINE) again.
|
||||
|
||||
The ordering of the events
|
||||
--------------------------
|
||||
The hotplug states are defined in ``include/linux/cpuhotplug.h``:
|
||||
The teardown callbacks in this section are not allowed to fail.
|
||||
|
||||
* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the
|
||||
CPU is up.
|
||||
* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked
|
||||
just the after the CPU has been brought up. The interrupts are off and
|
||||
the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE*
|
||||
the callbacks are invoked on the target CPU.
|
||||
* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are
|
||||
reserved for the dynamic allocation.
|
||||
* The states are invoked in the reverse order on CPU shutdown starting with
|
||||
*CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are
|
||||
invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*.
|
||||
* The STARTING section
|
||||
|
||||
The STARTING section covers the state space between CPUHP_BRINGUP_CPU + 1
|
||||
and CPUHP_AP_ONLINE.
|
||||
|
||||
The startup callbacks in this section are invoked on the hotplugged CPU
|
||||
with interrupts disabled during a CPU online operation in the early CPU
|
||||
setup code. The teardown callbacks are invoked with interrupts disabled
|
||||
on the hotplugged CPU during a CPU offline operation shortly before the
|
||||
CPU is completely shut down.
|
||||
|
||||
The callbacks in this section are not allowed to fail.
|
||||
|
||||
The callbacks are used for low level hardware initialization/shutdown and
|
||||
for core subsystems.
|
||||
|
||||
* The ONLINE section
|
||||
|
||||
The ONLINE section covers the state space between CPUHP_AP_ONLINE + 1 and
|
||||
CPUHP_ONLINE.
|
||||
|
||||
The startup callbacks in this section are invoked on the hotplugged CPU
|
||||
during a CPU online operation. The teardown callbacks are invoked on the
|
||||
hotplugged CPU during a CPU offline operation.
|
||||
|
||||
The callbacks are invoked in the context of the per CPU hotplug thread,
|
||||
which is pinned on the hotplugged CPU. The callbacks are invoked with
|
||||
interrupts and preemption enabled.
|
||||
|
||||
The callbacks are allowed to fail. When a callback fails the hotplug
|
||||
operation is aborted and the CPU is brought back to the previous state.
|
||||
|
||||
CPU online/offline operations
|
||||
-----------------------------
|
||||
|
||||
A successful online operation looks like this::
|
||||
|
||||
[CPUHP_OFFLINE]
|
||||
[CPUHP_OFFLINE + 1]->startup() -> success
|
||||
[CPUHP_OFFLINE + 2]->startup() -> success
|
||||
[CPUHP_OFFLINE + 3] -> skipped because startup == NULL
|
||||
...
|
||||
[CPUHP_BRINGUP_CPU]->startup() -> success
|
||||
=== End of PREPARE section
|
||||
[CPUHP_BRINGUP_CPU + 1]->startup() -> success
|
||||
...
|
||||
[CPUHP_AP_ONLINE]->startup() -> success
|
||||
=== End of STARTUP section
|
||||
[CPUHP_AP_ONLINE + 1]->startup() -> success
|
||||
...
|
||||
[CPUHP_ONLINE - 1]->startup() -> success
|
||||
[CPUHP_ONLINE]
|
||||
|
||||
A successful offline operation looks like this::
|
||||
|
||||
[CPUHP_ONLINE]
|
||||
[CPUHP_ONLINE - 1]->teardown() -> success
|
||||
...
|
||||
[CPUHP_AP_ONLINE + 1]->teardown() -> success
|
||||
=== Start of STARTUP section
|
||||
[CPUHP_AP_ONLINE]->teardown() -> success
|
||||
...
|
||||
[CPUHP_BRINGUP_ONLINE - 1]->teardown()
|
||||
...
|
||||
=== Start of PREPARE section
|
||||
[CPUHP_BRINGUP_CPU]->teardown()
|
||||
[CPUHP_OFFLINE + 3]->teardown()
|
||||
[CPUHP_OFFLINE + 2] -> skipped because teardown == NULL
|
||||
[CPUHP_OFFLINE + 1]->teardown()
|
||||
[CPUHP_OFFLINE]
|
||||
|
||||
A failed online operation looks like this::
|
||||
|
||||
[CPUHP_OFFLINE]
|
||||
[CPUHP_OFFLINE + 1]->startup() -> success
|
||||
[CPUHP_OFFLINE + 2]->startup() -> success
|
||||
[CPUHP_OFFLINE + 3] -> skipped because startup == NULL
|
||||
...
|
||||
[CPUHP_BRINGUP_CPU]->startup() -> success
|
||||
=== End of PREPARE section
|
||||
[CPUHP_BRINGUP_CPU + 1]->startup() -> success
|
||||
...
|
||||
[CPUHP_AP_ONLINE]->startup() -> success
|
||||
=== End of STARTUP section
|
||||
[CPUHP_AP_ONLINE + 1]->startup() -> success
|
||||
---
|
||||
[CPUHP_AP_ONLINE + N]->startup() -> fail
|
||||
[CPUHP_AP_ONLINE + (N - 1)]->teardown()
|
||||
...
|
||||
[CPUHP_AP_ONLINE + 1]->teardown()
|
||||
=== Start of STARTUP section
|
||||
[CPUHP_AP_ONLINE]->teardown()
|
||||
...
|
||||
[CPUHP_BRINGUP_ONLINE - 1]->teardown()
|
||||
...
|
||||
=== Start of PREPARE section
|
||||
[CPUHP_BRINGUP_CPU]->teardown()
|
||||
[CPUHP_OFFLINE + 3]->teardown()
|
||||
[CPUHP_OFFLINE + 2] -> skipped because teardown == NULL
|
||||
[CPUHP_OFFLINE + 1]->teardown()
|
||||
[CPUHP_OFFLINE]
|
||||
|
||||
A failed offline operation looks like this::
|
||||
|
||||
[CPUHP_ONLINE]
|
||||
[CPUHP_ONLINE - 1]->teardown() -> success
|
||||
...
|
||||
[CPUHP_ONLINE - N]->teardown() -> fail
|
||||
[CPUHP_ONLINE - (N - 1)]->startup()
|
||||
...
|
||||
[CPUHP_ONLINE - 1]->startup()
|
||||
[CPUHP_ONLINE]
|
||||
|
||||
Recursive failures cannot be handled sensibly. Look at the following
|
||||
example of a recursive fail due to a failed offline operation: ::
|
||||
|
||||
[CPUHP_ONLINE]
|
||||
[CPUHP_ONLINE - 1]->teardown() -> success
|
||||
...
|
||||
[CPUHP_ONLINE - N]->teardown() -> fail
|
||||
[CPUHP_ONLINE - (N - 1)]->startup() -> success
|
||||
[CPUHP_ONLINE - (N - 2)]->startup() -> fail
|
||||
|
||||
The CPU hotplug state machine stops right here and does not try to go back
|
||||
down again because that would likely result in an endless loop::
|
||||
|
||||
[CPUHP_ONLINE - (N - 1)]->teardown() -> success
|
||||
[CPUHP_ONLINE - N]->teardown() -> fail
|
||||
[CPUHP_ONLINE - (N - 1)]->startup() -> success
|
||||
[CPUHP_ONLINE - (N - 2)]->startup() -> fail
|
||||
[CPUHP_ONLINE - (N - 1)]->teardown() -> success
|
||||
[CPUHP_ONLINE - N]->teardown() -> fail
|
||||
|
||||
Lather, rinse and repeat. In this case the CPU left in state::
|
||||
|
||||
[CPUHP_ONLINE - (N - 1)]
|
||||
|
||||
which at least lets the system make progress and gives the user a chance to
|
||||
debug or even resolve the situation.
|
||||
|
||||
Allocating a state
|
||||
------------------
|
||||
|
||||
There are two ways to allocate a CPU hotplug state:
|
||||
|
||||
* Static allocation
|
||||
|
||||
Static allocation has to be used when the subsystem or driver has
|
||||
ordering requirements versus other CPU hotplug states. E.g. the PERF core
|
||||
startup callback has to be invoked before the PERF driver startup
|
||||
callbacks during a CPU online operation. During a CPU offline operation
|
||||
the driver teardown callbacks have to be invoked before the core teardown
|
||||
callback. The statically allocated states are described by constants in
|
||||
the cpuhp_state enum which can be found in include/linux/cpuhotplug.h.
|
||||
|
||||
Insert the state into the enum at the proper place so the ordering
|
||||
requirements are fulfilled. The state constant has to be used for state
|
||||
setup and removal.
|
||||
|
||||
Static allocation is also required when the state callbacks are not set
|
||||
up at runtime and are part of the initializer of the CPU hotplug state
|
||||
array in kernel/cpu.c.
|
||||
|
||||
* Dynamic allocation
|
||||
|
||||
When there are no ordering requirements for the state callbacks then
|
||||
dynamic allocation is the preferred method. The state number is allocated
|
||||
by the setup function and returned to the caller on success.
|
||||
|
||||
Only the PREPARE and ONLINE sections provide a dynamic allocation
|
||||
range. The STARTING section does not as most of the callbacks in that
|
||||
section have explicit ordering requirements.
|
||||
|
||||
Setup of a CPU hotplug state
|
||||
----------------------------
|
||||
|
||||
The core code provides the following functions to setup a state:
|
||||
|
||||
* cpuhp_setup_state(state, name, startup, teardown)
|
||||
* cpuhp_setup_state_nocalls(state, name, startup, teardown)
|
||||
* cpuhp_setup_state_cpuslocked(state, name, startup, teardown)
|
||||
* cpuhp_setup_state_nocalls_cpuslocked(state, name, startup, teardown)
|
||||
|
||||
For cases where a driver or a subsystem has multiple instances and the same
|
||||
CPU hotplug state callbacks need to be invoked for each instance, the CPU
|
||||
hotplug core provides multi-instance support. The advantage over driver
|
||||
specific instance lists is that the instance related functions are fully
|
||||
serialized against CPU hotplug operations and provide the automatic
|
||||
invocations of the state callbacks on add and removal. To set up such a
|
||||
multi-instance state the following function is available:
|
||||
|
||||
* cpuhp_setup_state_multi(state, name, startup, teardown)
|
||||
|
||||
The @state argument is either a statically allocated state or one of the
|
||||
constants for dynamically allocated states - CPUHP_PREPARE_DYN,
|
||||
CPUHP_ONLINE_DYN - depending on the state section (PREPARE, ONLINE) for
|
||||
which a dynamic state should be allocated.
|
||||
|
||||
The @name argument is used for sysfs output and for instrumentation. The
|
||||
naming convention is "subsys:mode" or "subsys/driver:mode",
|
||||
e.g. "perf:mode" or "perf/x86:mode". The common mode names are:
|
||||
|
||||
======== =======================================================
|
||||
prepare For states in the PREPARE section
|
||||
|
||||
dead For states in the PREPARE section which do not provide
|
||||
a startup callback
|
||||
|
||||
starting For states in the STARTING section
|
||||
|
||||
dying For states in the STARTING section which do not provide
|
||||
a startup callback
|
||||
|
||||
online For states in the ONLINE section
|
||||
|
||||
offline For states in the ONLINE section which do not provide
|
||||
a startup callback
|
||||
======== =======================================================
|
||||
|
||||
As the @name argument is only used for sysfs and instrumentation other mode
|
||||
descriptors can be used as well if they describe the nature of the state
|
||||
better than the common ones.
|
||||
|
||||
Examples for @name arguments: "perf/online", "perf/x86:prepare",
|
||||
"RCU/tree:dying", "sched/waitempty"
|
||||
|
||||
The @startup argument is a function pointer to the callback which should be
|
||||
invoked during a CPU online operation. If the usage site does not require a
|
||||
startup callback set the pointer to NULL.
|
||||
|
||||
The @teardown argument is a function pointer to the callback which should
|
||||
be invoked during a CPU offline operation. If the usage site does not
|
||||
require a teardown callback set the pointer to NULL.
|
||||
|
||||
The functions differ in the way how the installed callbacks are treated:
|
||||
|
||||
* cpuhp_setup_state_nocalls(), cpuhp_setup_state_nocalls_cpuslocked()
|
||||
and cpuhp_setup_state_multi() only install the callbacks
|
||||
|
||||
* cpuhp_setup_state() and cpuhp_setup_state_cpuslocked() install the
|
||||
callbacks and invoke the @startup callback (if not NULL) for all online
|
||||
CPUs which have currently a state greater than the newly installed
|
||||
state. Depending on the state section the callback is either invoked on
|
||||
the current CPU (PREPARE section) or on each online CPU (ONLINE
|
||||
section) in the context of the CPU's hotplug thread.
|
||||
|
||||
If a callback fails for CPU N then the teardown callback for CPU
|
||||
0 .. N-1 is invoked to rollback the operation. The state setup fails,
|
||||
the callbacks for the state are not installed and in case of dynamic
|
||||
allocation the allocated state is freed.
|
||||
|
||||
The state setup and the callback invocations are serialized against CPU
|
||||
hotplug operations. If the setup function has to be called from a CPU
|
||||
hotplug read locked region, then the _cpuslocked() variants have to be
|
||||
used. These functions cannot be used from within CPU hotplug callbacks.
|
||||
|
||||
The function return values:
|
||||
======== ===================================================================
|
||||
0 Statically allocated state was successfully set up
|
||||
|
||||
>0 Dynamically allocated state was successfully set up.
|
||||
|
||||
The returned number is the state number which was allocated. If
|
||||
the state callbacks have to be removed later, e.g. module
|
||||
removal, then this number has to be saved by the caller and used
|
||||
as @state argument for the state remove function. For
|
||||
multi-instance states the dynamically allocated state number is
|
||||
also required as @state argument for the instance add/remove
|
||||
operations.
|
||||
|
||||
<0 Operation failed
|
||||
======== ===================================================================
|
||||
|
||||
Removal of a CPU hotplug state
|
||||
------------------------------
|
||||
|
||||
To remove a previously set up state, the following functions are provided:
|
||||
|
||||
* cpuhp_remove_state(state)
|
||||
* cpuhp_remove_state_nocalls(state)
|
||||
* cpuhp_remove_state_nocalls_cpuslocked(state)
|
||||
* cpuhp_remove_multi_state(state)
|
||||
|
||||
The @state argument is either a statically allocated state or the state
|
||||
number which was allocated in the dynamic range by cpuhp_setup_state*(). If
|
||||
the state is in the dynamic range, then the state number is freed and
|
||||
available for dynamic allocation again.
|
||||
|
||||
The functions differ in the way how the installed callbacks are treated:
|
||||
|
||||
* cpuhp_remove_state_nocalls(), cpuhp_remove_state_nocalls_cpuslocked()
|
||||
and cpuhp_remove_multi_state() only remove the callbacks.
|
||||
|
||||
* cpuhp_remove_state() removes the callbacks and invokes the teardown
|
||||
callback (if not NULL) for all online CPUs which have currently a state
|
||||
greater than the removed state. Depending on the state section the
|
||||
callback is either invoked on the current CPU (PREPARE section) or on
|
||||
each online CPU (ONLINE section) in the context of the CPU's hotplug
|
||||
thread.
|
||||
|
||||
In order to complete the removal, the teardown callback should not fail.
|
||||
|
||||
The state removal and the callback invocations are serialized against CPU
|
||||
hotplug operations. If the remove function has to be called from a CPU
|
||||
hotplug read locked region, then the _cpuslocked() variants have to be
|
||||
used. These functions cannot be used from within CPU hotplug callbacks.
|
||||
|
||||
If a multi-instance state is removed then the caller has to remove all
|
||||
instances first.
|
||||
|
||||
Multi-Instance state instance management
|
||||
----------------------------------------
|
||||
|
||||
Once the multi-instance state is set up, instances can be added to the
|
||||
state:
|
||||
|
||||
* cpuhp_state_add_instance(state, node)
|
||||
* cpuhp_state_add_instance_nocalls(state, node)
|
||||
|
||||
The @state argument is either a statically allocated state or the state
|
||||
number which was allocated in the dynamic range by cpuhp_setup_state_multi().
|
||||
|
||||
The @node argument is a pointer to an hlist_node which is embedded in the
|
||||
instance's data structure. The pointer is handed to the multi-instance
|
||||
state callbacks and can be used by the callback to retrieve the instance
|
||||
via container_of().
|
||||
|
||||
The functions differ in the way how the installed callbacks are treated:
|
||||
|
||||
* cpuhp_state_add_instance_nocalls() and only adds the instance to the
|
||||
multi-instance state's node list.
|
||||
|
||||
* cpuhp_state_add_instance() adds the instance and invokes the startup
|
||||
callback (if not NULL) associated with @state for all online CPUs which
|
||||
have currently a state greater than @state. The callback is only
|
||||
invoked for the to be added instance. Depending on the state section
|
||||
the callback is either invoked on the current CPU (PREPARE section) or
|
||||
on each online CPU (ONLINE section) in the context of the CPU's hotplug
|
||||
thread.
|
||||
|
||||
If a callback fails for CPU N then the teardown callback for CPU
|
||||
0 .. N-1 is invoked to rollback the operation, the function fails and
|
||||
the instance is not added to the node list of the multi-instance state.
|
||||
|
||||
To remove an instance from the state's node list these functions are
|
||||
available:
|
||||
|
||||
* cpuhp_state_remove_instance(state, node)
|
||||
* cpuhp_state_remove_instance_nocalls(state, node)
|
||||
|
||||
The arguments are the same as for the the cpuhp_state_add_instance*()
|
||||
variants above.
|
||||
|
||||
The functions differ in the way how the installed callbacks are treated:
|
||||
|
||||
* cpuhp_state_remove_instance_nocalls() only removes the instance from the
|
||||
state's node list.
|
||||
|
||||
* cpuhp_state_remove_instance() removes the instance and invokes the
|
||||
teardown callback (if not NULL) associated with @state for all online
|
||||
CPUs which have currently a state greater than @state. The callback is
|
||||
only invoked for the to be removed instance. Depending on the state
|
||||
section the callback is either invoked on the current CPU (PREPARE
|
||||
section) or on each online CPU (ONLINE section) in the context of the
|
||||
CPU's hotplug thread.
|
||||
|
||||
In order to complete the removal, the teardown callback should not fail.
|
||||
|
||||
The node list add/remove operations and the callback invocations are
|
||||
serialized against CPU hotplug operations. These functions cannot be used
|
||||
from within CPU hotplug callbacks and CPU hotplug read locked regions.
|
||||
|
||||
Examples
|
||||
--------
|
||||
|
||||
Setup and teardown a statically allocated state in the STARTING section for
|
||||
notifications on online and offline operations::
|
||||
|
||||
ret = cpuhp_setup_state(CPUHP_SUBSYS_STARTING, "subsys:starting", subsys_cpu_starting, subsys_cpu_dying);
|
||||
if (ret < 0)
|
||||
return ret;
|
||||
....
|
||||
cpuhp_remove_state(CPUHP_SUBSYS_STARTING);
|
||||
|
||||
Setup and teardown a dynamically allocated state in the ONLINE section
|
||||
for notifications on offline operations::
|
||||
|
||||
state = cpuhp_setup_state(CPUHP_ONLINE_DYN, "subsys:offline", NULL, subsys_cpu_offline);
|
||||
if (state < 0)
|
||||
return state;
|
||||
....
|
||||
cpuhp_remove_state(state);
|
||||
|
||||
Setup and teardown a dynamically allocated state in the ONLINE section
|
||||
for notifications on online operations without invoking the callbacks::
|
||||
|
||||
state = cpuhp_setup_state_nocalls(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, NULL);
|
||||
if (state < 0)
|
||||
return state;
|
||||
....
|
||||
cpuhp_remove_state_nocalls(state);
|
||||
|
||||
Setup, use and teardown a dynamically allocated multi-instance state in the
|
||||
ONLINE section for notifications on online and offline operation::
|
||||
|
||||
state = cpuhp_setup_state_multi(CPUHP_ONLINE_DYN, "subsys:online", subsys_cpu_online, subsys_cpu_offline);
|
||||
if (state < 0)
|
||||
return state;
|
||||
....
|
||||
ret = cpuhp_state_add_instance(state, &inst1->node);
|
||||
if (ret)
|
||||
return ret;
|
||||
....
|
||||
ret = cpuhp_state_add_instance(state, &inst2->node);
|
||||
if (ret)
|
||||
return ret;
|
||||
....
|
||||
cpuhp_remove_instance(state, &inst1->node);
|
||||
....
|
||||
cpuhp_remove_instance(state, &inst2->node);
|
||||
....
|
||||
remove_multi_state(state);
|
||||
|
||||
A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough.
|
||||
However if an earlier invocation during the bring up or shutdown is required
|
||||
then an explicit state should be acquired. An explicit state might also be
|
||||
required if the hotplug event requires specific ordering in respect to
|
||||
another hotplug event.
|
||||
|
||||
Testing of hotplug states
|
||||
=========================
|
||||
|
||||
One way to verify whether a custom state is working as expected or not is to
|
||||
shutdown a CPU and then put it online again. It is also possible to put the CPU
|
||||
to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to
|
||||
*CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE*
|
||||
which would lead to rollback to the online state.
|
||||
|
||||
All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states``: ::
|
||||
All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states`` ::
|
||||
|
||||
$ tail /sys/devices/system/cpu/hotplug/states
|
||||
138: mm/vmscan:online
|
||||
@ -268,7 +657,7 @@ All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states
|
||||
168: sched:active
|
||||
169: online
|
||||
|
||||
To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: ::
|
||||
To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue::
|
||||
|
||||
$ cat /sys/devices/system/cpu/cpu4/hotplug/state
|
||||
169
|
||||
@ -276,14 +665,14 @@ To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: ::
|
||||
$ cat /sys/devices/system/cpu/cpu4/hotplug/state
|
||||
140
|
||||
|
||||
It is important to note that the teardown callbac of state 140 have been
|
||||
invoked. And now get back online: ::
|
||||
It is important to note that the teardown callback of state 140 have been
|
||||
invoked. And now get back online::
|
||||
|
||||
$ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target
|
||||
$ cat /sys/devices/system/cpu/cpu4/hotplug/state
|
||||
169
|
||||
|
||||
With trace events enabled, the individual steps are visible, too: ::
|
||||
With trace events enabled, the individual steps are visible, too::
|
||||
|
||||
# TASK-PID CPU# TIMESTAMP FUNCTION
|
||||
# | | | | |
|
||||
@ -318,6 +707,7 @@ trace.
|
||||
|
||||
Architecture's requirements
|
||||
===========================
|
||||
|
||||
The following functions and configurations are required:
|
||||
|
||||
``CONFIG_HOTPLUG_CPU``
|
||||
@ -339,11 +729,12 @@ The following functions and configurations are required:
|
||||
|
||||
User Space Notification
|
||||
=======================
|
||||
After CPU successfully onlined or offline udev events are sent. A udev rule like: ::
|
||||
|
||||
After CPU successfully onlined or offline udev events are sent. A udev rule like::
|
||||
|
||||
SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh"
|
||||
|
||||
will receive all events. A script like: ::
|
||||
will receive all events. A script like::
|
||||
|
||||
#!/bin/sh
|
||||
|
||||
|
@ -55,8 +55,24 @@ exist then it will allocate a new Linux irq_desc, associate it with
|
||||
the hwirq, and call the .map() callback so the driver can perform any
|
||||
required hardware setup.
|
||||
|
||||
When an interrupt is received, irq_find_mapping() function should
|
||||
be used to find the Linux IRQ number from the hwirq number.
|
||||
Once a mapping has been established, it can be retrieved or used via a
|
||||
variety of methods:
|
||||
|
||||
- irq_resolve_mapping() returns a pointer to the irq_desc structure
|
||||
for a given domain and hwirq number, and NULL if there was no
|
||||
mapping.
|
||||
- irq_find_mapping() returns a Linux IRQ number for a given domain and
|
||||
hwirq number, and 0 if there was no mapping
|
||||
- irq_linear_revmap() is now identical to irq_find_mapping(), and is
|
||||
deprecated
|
||||
- generic_handle_domain_irq() handles an interrupt described by a
|
||||
domain and a hwirq number
|
||||
- handle_domain_irq() does the same thing for root interrupt
|
||||
controllers and deals with the set_irq_reg()/irq_enter() sequences
|
||||
that most architecture requires
|
||||
|
||||
Note that irq domain lookups must happen in contexts that are
|
||||
compatible with a RCU read-side critical section.
|
||||
|
||||
The irq_create_mapping() function must be called *atleast once*
|
||||
before any call to irq_find_mapping(), lest the descriptor will not
|
||||
@ -137,7 +153,9 @@ required. Calling irq_create_direct_mapping() will allocate a Linux
|
||||
IRQ number and call the .map() callback so that driver can program the
|
||||
Linux IRQ number into the hardware.
|
||||
|
||||
Most drivers cannot use this mapping.
|
||||
Most drivers cannot use this mapping, and it is now gated on the
|
||||
CONFIG_IRQ_DOMAIN_NOMAP option. Please refrain from introducing new
|
||||
users of this API.
|
||||
|
||||
Legacy
|
||||
------
|
||||
@ -157,6 +175,10 @@ for IRQ numbers that are passed to struct device registrations. In that
|
||||
case the Linux IRQ numbers cannot be dynamically assigned and the legacy
|
||||
mapping should be used.
|
||||
|
||||
As the name implies, the *_legacy() functions are deprecated and only
|
||||
exist to ease the support of ancient platforms. No new users should be
|
||||
added.
|
||||
|
||||
The legacy map assumes a contiguous range of IRQ numbers has already
|
||||
been allocated for the controller and that the IRQ number can be
|
||||
calculated by adding a fixed offset to the hwirq number, and
|
||||
|
@ -315,6 +315,9 @@ Block Devices
|
||||
.. kernel-doc:: block/genhd.c
|
||||
:export:
|
||||
|
||||
.. kernel-doc:: block/bdev.c
|
||||
:export:
|
||||
|
||||
Char devices
|
||||
============
|
||||
|
||||
|
@ -107,9 +107,6 @@ also ``CONFIG_DYNAMIC_DEBUG`` in the case of pr_debug()) is defined.
|
||||
Function reference
|
||||
==================
|
||||
|
||||
.. kernel-doc:: kernel/printk/printk.c
|
||||
:functions: printk
|
||||
|
||||
.. kernel-doc:: include/linux/printk.h
|
||||
:functions: pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info
|
||||
:functions: printk pr_emerg pr_alert pr_crit pr_err pr_warn pr_notice pr_info
|
||||
pr_fmt pr_debug pr_devel pr_cont
|
||||
|
@ -130,6 +130,7 @@ printed after the symbol name with an extra ``b`` appended to the end of the
|
||||
specifier.
|
||||
|
||||
::
|
||||
|
||||
%pS versatile_init+0x0/0x110 [module_name]
|
||||
%pSb versatile_init+0x0/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
|
||||
%pSRb versatile_init+0x9/0x110 [module_name ed5019fdf5e53be37cb1ba7899292d7e143b259e]
|
||||
|
@ -75,9 +75,6 @@ And optionally
|
||||
.resume - A pointer to a per-policy resume function which is called
|
||||
with interrupts disabled and _before_ the governor is started again.
|
||||
|
||||
.ready - A pointer to a per-policy ready function which is called after
|
||||
the policy is fully initialized.
|
||||
|
||||
.attr - A pointer to a NULL-terminated list of "struct freq_attr" which
|
||||
allow to export values to sysfs.
|
||||
|
||||
|
@ -181,9 +181,16 @@ By default, KASAN prints a bug report only for the first invalid memory access.
|
||||
With ``kasan_multi_shot``, KASAN prints a report on every invalid access. This
|
||||
effectively disables ``panic_on_warn`` for KASAN reports.
|
||||
|
||||
Alternatively, independent of ``panic_on_warn`` the ``kasan.fault=`` boot
|
||||
parameter can be used to control panic and reporting behaviour:
|
||||
|
||||
- ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
|
||||
report or also panic the kernel (default: ``report``). The panic happens even
|
||||
if ``kasan_multi_shot`` is enabled.
|
||||
|
||||
Hardware tag-based KASAN mode (see the section about various modes below) is
|
||||
intended for use in production as a security mitigation. Therefore, it supports
|
||||
boot parameters that allow disabling KASAN or controlling its features.
|
||||
additional boot parameters that allow disabling KASAN or controlling features:
|
||||
|
||||
- ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: ``on``).
|
||||
|
||||
@ -199,10 +206,6 @@ boot parameters that allow disabling KASAN or controlling its features.
|
||||
- ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free stack
|
||||
traces collection (default: ``on``).
|
||||
|
||||
- ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
|
||||
report or also panic the kernel (default: ``report``). The panic happens even
|
||||
if ``kasan_multi_shot`` is enabled.
|
||||
|
||||
Implementation details
|
||||
----------------------
|
||||
|
||||
|
@ -127,6 +127,18 @@ Kconfig options:
|
||||
causes KCSAN to not report data races due to conflicts where the only plain
|
||||
accesses are aligned writes up to word size.
|
||||
|
||||
* ``CONFIG_KCSAN_PERMISSIVE``: Enable additional permissive rules to ignore
|
||||
certain classes of common data races. Unlike the above, the rules are more
|
||||
complex involving value-change patterns, access type, and address. This
|
||||
option depends on ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=y``. For details
|
||||
please see the ``kernel/kcsan/permissive.h``. Testers and maintainers that
|
||||
only focus on reports from specific subsystems and not the whole kernel are
|
||||
recommended to disable this option.
|
||||
|
||||
To use the strictest possible rules, select ``CONFIG_KCSAN_STRICT=y``, which
|
||||
configures KCSAN to follow the Linux-kernel memory consistency model (LKMM) as
|
||||
closely as possible.
|
||||
|
||||
DebugFS interface
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
@ -65,25 +65,27 @@ Error reports
|
||||
A typical out-of-bounds access looks like this::
|
||||
|
||||
==================================================================
|
||||
BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa3/0x22b
|
||||
BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234
|
||||
|
||||
Out-of-bounds read at 0xffffffffb672efff (1B left of kfence-#17):
|
||||
test_out_of_bounds_read+0xa3/0x22b
|
||||
kunit_try_run_case+0x51/0x85
|
||||
Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72):
|
||||
test_out_of_bounds_read+0xa6/0x234
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated by task 507:
|
||||
test_alloc+0xf3/0x25b
|
||||
test_out_of_bounds_read+0x98/0x22b
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kfence-#72: 0xffff8c3f2e292000-0xffff8c3f2e29201f, size=32, cache=kmalloc-32
|
||||
|
||||
allocated by task 484 on cpu 0 at 32.919330s:
|
||||
test_alloc+0xfe/0x738
|
||||
test_out_of_bounds_read+0x9b/0x234
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
|
||||
CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
|
||||
==================================================================
|
||||
|
||||
The header of the report provides a short summary of the function involved in
|
||||
@ -96,30 +98,32 @@ Use-after-free accesses are reported as::
|
||||
==================================================================
|
||||
BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143
|
||||
|
||||
Use-after-free read at 0xffffffffb673dfe0 (in kfence-#24):
|
||||
Use-after-free read at 0xffff8c3f2e2a0000 (in kfence-#79):
|
||||
test_use_after_free_read+0xb3/0x143
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated by task 507:
|
||||
test_alloc+0xf3/0x25b
|
||||
kfence-#79: 0xffff8c3f2e2a0000-0xffff8c3f2e2a001f, size=32, cache=kmalloc-32
|
||||
|
||||
allocated by task 488 on cpu 2 at 33.871326s:
|
||||
test_alloc+0xfe/0x738
|
||||
test_use_after_free_read+0x76/0x143
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
freed by task 507:
|
||||
freed by task 488 on cpu 2 at 33.871358s:
|
||||
test_use_after_free_read+0xa8/0x143
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
|
||||
CPU: 2 PID: 488 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
|
||||
==================================================================
|
||||
|
||||
KFENCE also reports on invalid frees, such as double-frees::
|
||||
@ -127,30 +131,32 @@ KFENCE also reports on invalid frees, such as double-frees::
|
||||
==================================================================
|
||||
BUG: KFENCE: invalid free in test_double_free+0xdc/0x171
|
||||
|
||||
Invalid free of 0xffffffffb6741000:
|
||||
Invalid free of 0xffff8c3f2e2a4000 (in kfence-#81):
|
||||
test_double_free+0xdc/0x171
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated by task 507:
|
||||
test_alloc+0xf3/0x25b
|
||||
kfence-#81: 0xffff8c3f2e2a4000-0xffff8c3f2e2a401f, size=32, cache=kmalloc-32
|
||||
|
||||
allocated by task 490 on cpu 1 at 34.175321s:
|
||||
test_alloc+0xfe/0x738
|
||||
test_double_free+0x76/0x171
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
freed by task 507:
|
||||
freed by task 490 on cpu 1 at 34.175348s:
|
||||
test_double_free+0xa8/0x171
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
|
||||
CPU: 1 PID: 490 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
|
||||
==================================================================
|
||||
|
||||
KFENCE also uses pattern-based redzones on the other side of an object's guard
|
||||
@ -160,23 +166,25 @@ These are reported on frees::
|
||||
==================================================================
|
||||
BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184
|
||||
|
||||
Corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ] (in kfence-#69):
|
||||
Corrupted memory at 0xffff8c3f2e33aff9 [ 0xac . . . . . . ] (in kfence-#156):
|
||||
test_kmalloc_aligned_oob_write+0xef/0x184
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated by task 507:
|
||||
test_alloc+0xf3/0x25b
|
||||
kfence-#156: 0xffff8c3f2e33afb0-0xffff8c3f2e33aff8, size=73, cache=kmalloc-96
|
||||
|
||||
allocated by task 502 on cpu 7 at 42.159302s:
|
||||
test_alloc+0xfe/0x738
|
||||
test_kmalloc_aligned_oob_write+0x57/0x184
|
||||
kunit_try_run_case+0x51/0x85
|
||||
kunit_try_run_case+0x61/0xa0
|
||||
kunit_generic_run_threadfn_adapter+0x16/0x30
|
||||
kthread+0x137/0x160
|
||||
kthread+0x176/0x1b0
|
||||
ret_from_fork+0x22/0x30
|
||||
|
||||
CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
|
||||
CPU: 7 PID: 502 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7
|
||||
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
|
||||
==================================================================
|
||||
|
||||
For such errors, the address where the corruption occurred as well as the
|
||||
|
@ -114,9 +114,12 @@ results in TAP format, you can pass the ``--raw_output`` argument.
|
||||
|
||||
./tools/testing/kunit/kunit.py run --raw_output
|
||||
|
||||
.. note::
|
||||
The raw output from test runs may contain other, non-KUnit kernel log
|
||||
lines.
|
||||
The raw output from test runs may contain other, non-KUnit kernel log
|
||||
lines. You can see just KUnit output with ``--raw_output=kunit``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./tools/testing/kunit/kunit.py run --raw_output=kunit
|
||||
|
||||
If you have KUnit results in their raw TAP format, you can parse them and print
|
||||
the human-readable summary with the ``parse`` command for kunit_tool. This
|
||||
|
@ -80,25 +80,23 @@ file ``.kunitconfig``, you can just pass in the dir, e.g.
|
||||
automagically, but tests could theoretically depend on incompatible
|
||||
options, so handling that would be tricky.
|
||||
|
||||
Setting kernel commandline parameters
|
||||
-------------------------------------
|
||||
|
||||
You can use ``--kernel_args`` to pass arbitrary kernel arguments, e.g.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ./tools/testing/kunit/kunit.py run --kernel_args=param=42 --kernel_args=param2=false
|
||||
|
||||
|
||||
Generating code coverage reports under UML
|
||||
------------------------------------------
|
||||
|
||||
.. note::
|
||||
TODO(brendanhiggins@google.com): There are various issues with UML and
|
||||
versions of gcc 7 and up. You're likely to run into missing ``.gcda``
|
||||
files or compile errors. We know one `faulty GCC commit
|
||||
<https://github.com/gcc-mirror/gcc/commit/8c9434c2f9358b8b8bad2c1990edf10a21645f9d>`_
|
||||
but not how we'd go about getting this fixed. The compile errors still
|
||||
need some investigation.
|
||||
|
||||
.. note::
|
||||
TODO(brendanhiggins@google.com): for recent versions of Linux
|
||||
(5.10-5.12, maybe earlier), there's a bug with gcov counters not being
|
||||
flushed in UML. This translates to very low (<1%) reported coverage. This is
|
||||
related to the above issue and can be worked around by replacing the
|
||||
one call to ``uml_abort()`` (it's in ``os_dump_core()``) with a plain
|
||||
``exit()``.
|
||||
|
||||
files or compile errors.
|
||||
|
||||
This is different from the "normal" way of getting coverage information that is
|
||||
documented in Documentation/dev-tools/gcov.rst.
|
||||
|
@ -28,7 +28,7 @@ find_cmd = find $(srctree)/$(src) \( -name '*.yaml' ! \
|
||||
|
||||
quiet_cmd_yamllint = LINT $(src)
|
||||
cmd_yamllint = ($(find_cmd) | \
|
||||
xargs $(DT_SCHEMA_LINT) -f parsable -c $(srctree)/$(src)/.yamllint) || true
|
||||
xargs $(DT_SCHEMA_LINT) -f parsable -c $(srctree)/$(src)/.yamllint >&2) || true
|
||||
|
||||
quiet_cmd_chk_bindings = CHKDT $@
|
||||
cmd_chk_bindings = ($(find_cmd) | \
|
||||
|
@ -145,6 +145,11 @@ properties:
|
||||
- const: atmel,sama5d4
|
||||
- const: atmel,sama5
|
||||
|
||||
- items:
|
||||
- const: microchip,sama7g5ek # SAMA7G5 Evaluation Kit
|
||||
- const: microchip,sama7g5
|
||||
- const: microchip,sama7
|
||||
|
||||
- items:
|
||||
- enum:
|
||||
- atmel,sams70j19
|
||||
|
@ -45,7 +45,8 @@ RAMC SDRAM/DDR Controller required properties:
|
||||
"atmel,at91sam9260-sdramc",
|
||||
"atmel,at91sam9g45-ddramc",
|
||||
"atmel,sama5d3-ddramc",
|
||||
"microchip,sam9x60-ddramc"
|
||||
"microchip,sam9x60-ddramc",
|
||||
"microchip,sama7g5-uddrc"
|
||||
- reg: Should contain registers location and length
|
||||
|
||||
Examples:
|
||||
@ -55,6 +56,17 @@ Examples:
|
||||
reg = <0xffffe800 0x200>;
|
||||
};
|
||||
|
||||
RAMC PHY Controller required properties:
|
||||
- compatible: Should be "microchip,sama7g5-ddr3phy", "syscon"
|
||||
- reg: Should contain registers location and length
|
||||
|
||||
Example:
|
||||
|
||||
ddr3phy: ddr3phy@e3804000 {
|
||||
compatible = "microchip,sama7g5-ddr3phy", "syscon";
|
||||
reg = <0xe3804000 0x1000>;
|
||||
};
|
||||
|
||||
SHDWC Shutdown Controller
|
||||
|
||||
required properties:
|
||||
|
@ -221,9 +221,13 @@ properties:
|
||||
- prt,prti6q # Protonic PRTI6Q board
|
||||
- prt,prtwd2 # Protonic WD2 board
|
||||
- rex,imx6q-rex-pro # Rex Pro i.MX6 Quad Board
|
||||
- skov,imx6q-skov-revc-lt2 # SKOV IMX6 CPU QuadCore lt2
|
||||
- skov,imx6q-skov-revc-lt6 # SKOV IMX6 CPU QuadCore lt6
|
||||
- skov,imx6q-skov-reve-mi1010ait-1cp1 # SKOV IMX6 CPU QuadCore mi1010ait-1cp1
|
||||
- solidrun,cubox-i/q # SolidRun Cubox-i Dual/Quad
|
||||
- solidrun,hummingboard/q
|
||||
- solidrun,hummingboard2/q
|
||||
- solidrun,solidsense/q # SolidRun SolidSense Dual/Quad
|
||||
- tbs,imx6q-tbs2910 # TBS2910 Matrix ARM mini PC
|
||||
- technexion,imx6q-pico-dwarf # TechNexion i.MX6Q Pico-Dwarf
|
||||
- technexion,imx6q-pico-hobbit # TechNexion i.MX6Q Pico-Hobbit
|
||||
@ -377,9 +381,12 @@ properties:
|
||||
- prt,prtvt7 # Protonic VT7 board
|
||||
- rex,imx6dl-rex-basic # Rex Basic i.MX6 Dual Lite Board
|
||||
- riot,imx6s-riotboard # RIoTboard i.MX6S
|
||||
- skov,imx6dl-skov-revc-lt2 # SKOV IMX6 CPU SoloCore lt2
|
||||
- skov,imx6dl-skov-revc-lt6 # SKOV IMX6 CPU SoloCore lt6
|
||||
- solidrun,cubox-i/dl # SolidRun Cubox-i Solo/DualLite
|
||||
- solidrun,hummingboard/dl
|
||||
- solidrun,hummingboard2/dl # SolidRun HummingBoard2 Solo/DualLite
|
||||
- solidrun,solidsense/dl # SolidRun SolidSense Solo/DualLite
|
||||
- technexion,imx6dl-pico-dwarf # TechNexion i.MX6DL Pico-Dwarf
|
||||
- technexion,imx6dl-pico-hobbit # TechNexion i.MX6DL Pico-Hobbit
|
||||
- technexion,imx6dl-pico-nymph # TechNexion i.MX6DL Pico-Nymph
|
||||
@ -418,6 +425,12 @@ properties:
|
||||
- const: dfi,fs700e-m60
|
||||
- const: fsl,imx6dl
|
||||
|
||||
- description: i.MX6DL DHCOM PicoITX Board
|
||||
items:
|
||||
- const: dh,imx6dl-dhcom-picoitx
|
||||
- const: dh,imx6dl-dhcom-som
|
||||
- const: fsl,imx6dl
|
||||
|
||||
- description: i.MX6DL Gateworks Ventana Boards
|
||||
items:
|
||||
- enum:
|
||||
@ -469,6 +482,12 @@ properties:
|
||||
- const: toradex,colibri_imx6dl # Colibri iMX6 Module
|
||||
- const: fsl,imx6dl
|
||||
|
||||
- description: i.MX6S DHCOM DRC02 Board
|
||||
items:
|
||||
- const: dh,imx6s-dhcom-drc02
|
||||
- const: dh,imx6s-dhcom-som
|
||||
- const: fsl,imx6dl
|
||||
|
||||
- description: i.MX6SL based Boards
|
||||
items:
|
||||
- enum:
|
||||
@ -698,6 +717,7 @@ properties:
|
||||
- gw,imx8mm-gw72xx-0x # i.MX8MM Gateworks Development Kit
|
||||
- gw,imx8mm-gw73xx-0x # i.MX8MM Gateworks Development Kit
|
||||
- gw,imx8mm-gw7901 # i.MX8MM Gateworks Board
|
||||
- gw,imx8mm-gw7902 # i.MX8MM Gateworks Board
|
||||
- kontron,imx8mm-n801x-som # i.MX8MM Kontron SL (N801X) SOM
|
||||
- variscite,var-som-mx8mm # i.MX8MM Variscite VAR-SOM-MX8MM module
|
||||
- const: fsl,imx8mm
|
||||
@ -728,6 +748,7 @@ properties:
|
||||
- beacon,imx8mn-beacon-kit # i.MX8MN Beacon Development Kit
|
||||
- fsl,imx8mn-ddr4-evk # i.MX8MN DDR4 EVK Board
|
||||
- fsl,imx8mn-evk # i.MX8MN LPDDR4 EVK Board
|
||||
- gw,imx8mn-gw7902 # i.MX8MM Gateworks Board
|
||||
- const: fsl,imx8mn
|
||||
|
||||
- description: Variscite VAR-SOM-MX8MN based boards
|
||||
@ -752,10 +773,12 @@ properties:
|
||||
items:
|
||||
- enum:
|
||||
- boundary,imx8mq-nitrogen8m # i.MX8MQ NITROGEN Board
|
||||
- boundary,imx8mq-nitrogen8m-som # i.MX8MQ NITROGEN SoM
|
||||
- einfochips,imx8mq-thor96 # i.MX8MQ Thor96 Board
|
||||
- fsl,imx8mq-evk # i.MX8MQ EVK Board
|
||||
- google,imx8mq-phanbell # Google Coral Edge TPU
|
||||
- kontron,pitx-imx8m # Kontron pITX-imx8m Board
|
||||
- mntre,reform2 # MNT Reform2 Laptop
|
||||
- purism,librem5-devkit # Purism Librem5 devkit
|
||||
- solidrun,hummingboard-pulse # SolidRun Hummingboard Pulse
|
||||
- technexion,pico-pi-imx8m # TechNexion PICO-PI-8M evk
|
||||
@ -973,6 +996,12 @@ properties:
|
||||
- fsl,s32v234-evb # S32V234-EVB2 Customer Evaluation Board
|
||||
- const: fsl,s32v234
|
||||
|
||||
- description: Traverse LS1088A based Boards
|
||||
items:
|
||||
- enum:
|
||||
- traverse,ten64 # Ten64 Networking Appliance / Board
|
||||
- const: fsl,ls1088a
|
||||
|
||||
additionalProperties: true
|
||||
|
||||
...
|
||||
|
@ -1,108 +0,0 @@
|
||||
Cortina systems Gemini platforms
|
||||
|
||||
The Gemini SoC is the project name for an ARMv4 FA525-based SoC originally
|
||||
produced by Storlink Semiconductor around 2005. The company was renamed
|
||||
later renamed Storm Semiconductor. The chip product name is Storlink SL3516.
|
||||
It was derived from earlier products from Storm named SL3316 (Centroid) and
|
||||
SL3512 (Bulverde).
|
||||
|
||||
Storm Semiconductor was acquired by Cortina Systems in 2008 and the SoC was
|
||||
produced and used for NAS and similar usecases. In 2014 Cortina Systems was
|
||||
in turn acquired by Inphi, who seem to have discontinued this product family.
|
||||
|
||||
Many of the IP blocks used in the SoC comes from Faraday Technology.
|
||||
|
||||
Required properties (in root node):
|
||||
compatible = "cortina,gemini";
|
||||
|
||||
Required nodes:
|
||||
|
||||
- soc: the SoC should be represented by a simple bus encompassing all the
|
||||
onchip devices, this is referred to as the soc bus node.
|
||||
|
||||
- syscon: the soc bus node must have a system controller node pointing to the
|
||||
global control registers, with the compatible string
|
||||
"cortina,gemini-syscon", "syscon";
|
||||
|
||||
Required properties on the syscon:
|
||||
- reg: syscon register location and size.
|
||||
- #clock-cells: should be set to <1> - the system controller is also a
|
||||
clock provider.
|
||||
- #reset-cells: should be set to <1> - the system controller is also a
|
||||
reset line provider.
|
||||
|
||||
The clock sources have shorthand defines in the include file:
|
||||
<dt-bindings/clock/cortina,gemini-clock.h>
|
||||
|
||||
The reset lines have shorthand defines in the include file:
|
||||
<dt-bindings/reset/cortina,gemini-reset.h>
|
||||
|
||||
- timer: the soc bus node must have a timer node pointing to the SoC timer
|
||||
block, with the compatible string "cortina,gemini-timer"
|
||||
See: clocksource/cortina,gemini-timer.txt
|
||||
|
||||
- interrupt-controller: the sob bus node must have an interrupt controller
|
||||
node pointing to the SoC interrupt controller block, with the compatible
|
||||
string "cortina,gemini-interrupt-controller"
|
||||
See interrupt-controller/cortina,gemini-interrupt-controller.txt
|
||||
|
||||
Example:
|
||||
|
||||
/ {
|
||||
model = "Foo Gemini Machine";
|
||||
compatible = "cortina,gemini";
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
|
||||
memory {
|
||||
device_type = "memory";
|
||||
reg = <0x00000000 0x8000000>;
|
||||
};
|
||||
|
||||
soc {
|
||||
#address-cells = <1>;
|
||||
#size-cells = <1>;
|
||||
ranges;
|
||||
compatible = "simple-bus";
|
||||
interrupt-parent = <&intcon>;
|
||||
|
||||
syscon: syscon@40000000 {
|
||||
compatible = "cortina,gemini-syscon", "syscon";
|
||||
reg = <0x40000000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
#reset-cells = <1>;
|
||||
};
|
||||
|
||||
uart0: serial@42000000 {
|
||||
compatible = "ns16550a";
|
||||
reg = <0x42000000 0x100>;
|
||||
resets = <&syscon GEMINI_RESET_UART>;
|
||||
clocks = <&syscon GEMINI_CLK_UART>;
|
||||
interrupts = <18 IRQ_TYPE_LEVEL_HIGH>;
|
||||
reg-shift = <2>;
|
||||
};
|
||||
|
||||
timer@43000000 {
|
||||
compatible = "cortina,gemini-timer";
|
||||
reg = <0x43000000 0x1000>;
|
||||
interrupt-parent = <&intcon>;
|
||||
interrupts = <14 IRQ_TYPE_EDGE_FALLING>, /* Timer 1 */
|
||||
<15 IRQ_TYPE_EDGE_FALLING>, /* Timer 2 */
|
||||
<16 IRQ_TYPE_EDGE_FALLING>; /* Timer 3 */
|
||||
resets = <&syscon GEMINI_RESET_TIMER>;
|
||||
/* APB clock or RTC clock */
|
||||
clocks = <&syscon GEMINI_CLK_APB>,
|
||||
<&syscon GEMINI_CLK_RTC>;
|
||||
clock-names = "PCLK", "EXTCLK";
|
||||
syscon = <&syscon>;
|
||||
};
|
||||
|
||||
intcon: interrupt-controller@48000000 {
|
||||
compatible = "cortina,gemini-interrupt-controller";
|
||||
reg = <0x48000000 0x1000>;
|
||||
resets = <&syscon GEMINI_RESET_INTCON0>;
|
||||
interrupt-controller;
|
||||
#interrupt-cells = <2>;
|
||||
};
|
||||
};
|
||||
};
|
95
Documentation/devicetree/bindings/arm/gemini.yaml
Normal file
95
Documentation/devicetree/bindings/arm/gemini.yaml
Normal file
@ -0,0 +1,95 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: http://devicetree.org/schemas/arm/gemini.yaml#
|
||||
$schema: http://devicetree.org/meta-schemas/core.yaml#
|
||||
|
||||
title: Cortina systems Gemini platforms
|
||||
|
||||
description: |
|
||||
The Gemini SoC is the project name for an ARMv4 FA525-based SoC originally
|
||||
produced by Storlink Semiconductor around 2005. The company was renamed
|
||||
later renamed Storm Semiconductor. The chip product name is Storlink SL3516.
|
||||
It was derived from earlier products from Storm named SL3316 (Centroid) and
|
||||
SL3512 (Bulverde).
|
||||
|
||||
Storm Semiconductor was acquired by Cortina Systems in 2008 and the SoC was
|
||||
produced and used for NAS and similar usecases. In 2014 Cortina Systems was
|
||||
in turn acquired by Inphi, who seem to have discontinued this product family.
|
||||
|
||||
Many of the IP blocks used in the SoC comes from Faraday Technology.
|
||||
|
||||
maintainers:
|
||||
- Linus Walleij <linus.walleij@linaro.org>
|
||||
|
||||
properties:
|
||||
$nodename:
|
||||
const: '/'
|
||||
compatible:
|
||||
oneOf:
|
||||
|
||||
- description: Storlink Semiconductor Gemini324 EV-Board also known
|
||||
as Storm Semiconductor SL93512R_BRD
|
||||
items:
|
||||
- const: storlink,gemini324
|
||||
- const: storm,sl93512r
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: D-Link DIR-685 Xtreme N Storage Router
|
||||
items:
|
||||
- const: dlink,dir-685
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: D-Link DNS-313 1-Bay Network Storage Enclosure
|
||||
items:
|
||||
- const: dlink,dns-313
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: Edimax NS-2502
|
||||
items:
|
||||
- const: edimax,ns-2502
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: ITian Square One SQ201
|
||||
items:
|
||||
- const: itian,sq201
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: Raidsonic NAS IB-4220-B
|
||||
items:
|
||||
- const: raidsonic,ib-4220-b
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: SSI 1328
|
||||
items:
|
||||
- const: ssi,1328
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: Teltonika RUT1xx Mobile Router
|
||||
items:
|
||||
- const: teltonika,rut1xx
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: Wiligear Wiliboard WBD-111
|
||||
items:
|
||||
- const: wiligear,wiliboard-wbd111
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: Wiligear Wiliboard WBD-222
|
||||
items:
|
||||
- const: wiligear,wiliboard-wbd222
|
||||
- const: cortina,gemini
|
||||
|
||||
- description: Wiligear Wiliboard WBD-111 - old incorrect binding
|
||||
items:
|
||||
- const: wiliboard,wbd111
|
||||
- const: cortina,gemini
|
||||
deprecated: true
|
||||
|
||||
- description: Wiligear Wiliboard WBD-222 - old incorrect binding
|
||||
items:
|
||||
- const: wiliboard,wbd222
|
||||
- const: cortina,gemini
|
||||
deprecated: true
|
||||
|
||||
additionalProperties: true
|
@ -13,6 +13,7 @@ Required Properties:
|
||||
- "mediatek,mt7623-audsys", "mediatek,mt2701-audsys", "syscon"
|
||||
- "mediatek,mt8167-audiosys", "syscon"
|
||||
- "mediatek,mt8183-audiosys", "syscon"
|
||||
- "mediatek,mt8192-audsys", "syscon"
|
||||
- "mediatek,mt8516-audsys", "syscon"
|
||||
- #clock-cells: Must be 1
|
||||
|
||||
|
@ -1,31 +0,0 @@
|
||||
Mediatek mmsys controller
|
||||
============================
|
||||
|
||||
The Mediatek mmsys system controller provides clock control, routing control,
|
||||
and miscellaneous control in mmsys partition.
|
||||
|
||||
Required Properties:
|
||||
|
||||
- compatible: Should be one of:
|
||||
- "mediatek,mt2701-mmsys", "syscon"
|
||||
- "mediatek,mt2712-mmsys", "syscon"
|
||||
- "mediatek,mt6765-mmsys", "syscon"
|
||||
- "mediatek,mt6779-mmsys", "syscon"
|
||||
- "mediatek,mt6797-mmsys", "syscon"
|
||||
- "mediatek,mt7623-mmsys", "mediatek,mt2701-mmsys", "syscon"
|
||||
- "mediatek,mt8167-mmsys", "syscon"
|
||||
- "mediatek,mt8173-mmsys", "syscon"
|
||||
- "mediatek,mt8183-mmsys", "syscon"
|
||||
- #clock-cells: Must be 1
|
||||
|
||||
For the clock control, the mmsys controller uses the common clk binding from
|
||||
Documentation/devicetree/bindings/clock/clock-bindings.txt
|
||||
The available clocks are defined in dt-bindings/clock/mt*-clk.h.
|
||||
|
||||
Example:
|
||||
|
||||
mmsys: syscon@14000000 {
|
||||
compatible = "mediatek,mt8173-mmsys", "syscon";
|
||||
reg = <0 0x14000000 0 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
@ -0,0 +1,59 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: "http://devicetree.org/schemas/arm/mediatek/mediatek,mmsys.yaml#"
|
||||
$schema: "http://devicetree.org/meta-schemas/core.yaml#"
|
||||
|
||||
title: MediaTek mmsys controller
|
||||
|
||||
maintainers:
|
||||
- Matthias Brugger <matthias.bgg@gmail.com>
|
||||
|
||||
description:
|
||||
The MediaTek mmsys system controller provides clock control, routing control,
|
||||
and miscellaneous control in mmsys partition.
|
||||
|
||||
properties:
|
||||
$nodename:
|
||||
pattern: "^syscon@[0-9a-f]+$"
|
||||
|
||||
compatible:
|
||||
oneOf:
|
||||
- items:
|
||||
- enum:
|
||||
- mediatek,mt2701-mmsys
|
||||
- mediatek,mt2712-mmsys
|
||||
- mediatek,mt6765-mmsys
|
||||
- mediatek,mt6779-mmsys
|
||||
- mediatek,mt6797-mmsys
|
||||
- mediatek,mt8167-mmsys
|
||||
- mediatek,mt8173-mmsys
|
||||
- mediatek,mt8183-mmsys
|
||||
- mediatek,mt8192-mmsys
|
||||
- mediatek,mt8365-mmsys
|
||||
- const: syscon
|
||||
- items:
|
||||
- const: mediatek,mt7623-mmsys
|
||||
- const: mediatek,mt2701-mmsys
|
||||
- const: syscon
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
|
||||
"#clock-cells":
|
||||
const: 1
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
- "#clock-cells"
|
||||
|
||||
additionalProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
mmsys: syscon@14000000 {
|
||||
compatible = "mediatek,mt8173-mmsys", "syscon";
|
||||
reg = <0x14000000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
@ -0,0 +1,199 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: "http://devicetree.org/schemas/arm/mediatek/mediatek,mt8192-clock.yaml#"
|
||||
$schema: "http://devicetree.org/meta-schemas/core.yaml#"
|
||||
|
||||
title: MediaTek Functional Clock Controller for MT8192
|
||||
|
||||
maintainers:
|
||||
- Chun-Jie Chen <chun-jie.chen@mediatek.com>
|
||||
|
||||
description:
|
||||
The Mediatek functional clock controller provides various clocks on MT8192.
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
items:
|
||||
- enum:
|
||||
- mediatek,mt8192-scp_adsp
|
||||
- mediatek,mt8192-imp_iic_wrap_c
|
||||
- mediatek,mt8192-imp_iic_wrap_e
|
||||
- mediatek,mt8192-imp_iic_wrap_s
|
||||
- mediatek,mt8192-imp_iic_wrap_ws
|
||||
- mediatek,mt8192-imp_iic_wrap_w
|
||||
- mediatek,mt8192-imp_iic_wrap_n
|
||||
- mediatek,mt8192-msdc_top
|
||||
- mediatek,mt8192-msdc
|
||||
- mediatek,mt8192-mfgcfg
|
||||
- mediatek,mt8192-imgsys
|
||||
- mediatek,mt8192-imgsys2
|
||||
- mediatek,mt8192-vdecsys_soc
|
||||
- mediatek,mt8192-vdecsys
|
||||
- mediatek,mt8192-vencsys
|
||||
- mediatek,mt8192-camsys
|
||||
- mediatek,mt8192-camsys_rawa
|
||||
- mediatek,mt8192-camsys_rawb
|
||||
- mediatek,mt8192-camsys_rawc
|
||||
- mediatek,mt8192-ipesys
|
||||
- mediatek,mt8192-mdpsys
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
|
||||
'#clock-cells':
|
||||
const: 1
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
|
||||
additionalProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
scp_adsp: clock-controller@10720000 {
|
||||
compatible = "mediatek,mt8192-scp_adsp";
|
||||
reg = <0x10720000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imp_iic_wrap_c: clock-controller@11007000 {
|
||||
compatible = "mediatek,mt8192-imp_iic_wrap_c";
|
||||
reg = <0x11007000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imp_iic_wrap_e: clock-controller@11cb1000 {
|
||||
compatible = "mediatek,mt8192-imp_iic_wrap_e";
|
||||
reg = <0x11cb1000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imp_iic_wrap_s: clock-controller@11d03000 {
|
||||
compatible = "mediatek,mt8192-imp_iic_wrap_s";
|
||||
reg = <0x11d03000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imp_iic_wrap_ws: clock-controller@11d23000 {
|
||||
compatible = "mediatek,mt8192-imp_iic_wrap_ws";
|
||||
reg = <0x11d23000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imp_iic_wrap_w: clock-controller@11e01000 {
|
||||
compatible = "mediatek,mt8192-imp_iic_wrap_w";
|
||||
reg = <0x11e01000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imp_iic_wrap_n: clock-controller@11f02000 {
|
||||
compatible = "mediatek,mt8192-imp_iic_wrap_n";
|
||||
reg = <0x11f02000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
msdc_top: clock-controller@11f10000 {
|
||||
compatible = "mediatek,mt8192-msdc_top";
|
||||
reg = <0x11f10000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
msdc: clock-controller@11f60000 {
|
||||
compatible = "mediatek,mt8192-msdc";
|
||||
reg = <0x11f60000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
mfgcfg: clock-controller@13fbf000 {
|
||||
compatible = "mediatek,mt8192-mfgcfg";
|
||||
reg = <0x13fbf000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imgsys: clock-controller@15020000 {
|
||||
compatible = "mediatek,mt8192-imgsys";
|
||||
reg = <0x15020000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
imgsys2: clock-controller@15820000 {
|
||||
compatible = "mediatek,mt8192-imgsys2";
|
||||
reg = <0x15820000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
vdecsys_soc: clock-controller@1600f000 {
|
||||
compatible = "mediatek,mt8192-vdecsys_soc";
|
||||
reg = <0x1600f000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
vdecsys: clock-controller@1602f000 {
|
||||
compatible = "mediatek,mt8192-vdecsys";
|
||||
reg = <0x1602f000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
vencsys: clock-controller@17000000 {
|
||||
compatible = "mediatek,mt8192-vencsys";
|
||||
reg = <0x17000000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
camsys: clock-controller@1a000000 {
|
||||
compatible = "mediatek,mt8192-camsys";
|
||||
reg = <0x1a000000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
camsys_rawa: clock-controller@1a04f000 {
|
||||
compatible = "mediatek,mt8192-camsys_rawa";
|
||||
reg = <0x1a04f000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
camsys_rawb: clock-controller@1a06f000 {
|
||||
compatible = "mediatek,mt8192-camsys_rawb";
|
||||
reg = <0x1a06f000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
camsys_rawc: clock-controller@1a08f000 {
|
||||
compatible = "mediatek,mt8192-camsys_rawc";
|
||||
reg = <0x1a08f000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
ipesys: clock-controller@1b000000 {
|
||||
compatible = "mediatek,mt8192-ipesys";
|
||||
reg = <0x1b000000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
mdpsys: clock-controller@1f000000 {
|
||||
compatible = "mediatek,mt8192-mdpsys";
|
||||
reg = <0x1f000000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
@ -0,0 +1,65 @@
|
||||
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
|
||||
%YAML 1.2
|
||||
---
|
||||
$id: "http://devicetree.org/schemas/arm/mediatek/mediatek,mt8192-sys-clock.yaml#"
|
||||
$schema: "http://devicetree.org/meta-schemas/core.yaml#"
|
||||
|
||||
title: MediaTek System Clock Controller for MT8192
|
||||
|
||||
maintainers:
|
||||
- Chun-Jie Chen <chun-jie.chen@mediatek.com>
|
||||
|
||||
description:
|
||||
The Mediatek system clock controller provides various clocks and system configuration
|
||||
like reset and bus protection on MT8192.
|
||||
|
||||
properties:
|
||||
compatible:
|
||||
items:
|
||||
- enum:
|
||||
- mediatek,mt8192-topckgen
|
||||
- mediatek,mt8192-infracfg
|
||||
- mediatek,mt8192-pericfg
|
||||
- mediatek,mt8192-apmixedsys
|
||||
- const: syscon
|
||||
|
||||
reg:
|
||||
maxItems: 1
|
||||
|
||||
'#clock-cells':
|
||||
const: 1
|
||||
|
||||
required:
|
||||
- compatible
|
||||
- reg
|
||||
|
||||
additionalProperties: false
|
||||
|
||||
examples:
|
||||
- |
|
||||
topckgen: syscon@10000000 {
|
||||
compatible = "mediatek,mt8192-topckgen", "syscon";
|
||||
reg = <0x10000000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
infracfg: syscon@10001000 {
|
||||
compatible = "mediatek,mt8192-infracfg", "syscon";
|
||||
reg = <0x10001000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
pericfg: syscon@10003000 {
|
||||
compatible = "mediatek,mt8192-pericfg", "syscon";
|
||||
reg = <0x10003000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
||||
|
||||
- |
|
||||
apmixedsys: syscon@1000c000 {
|
||||
compatible = "mediatek,mt8192-apmixedsys", "syscon";
|
||||
reg = <0x1000c000 0x1000>;
|
||||
#clock-cells = <1>;
|
||||
};
|
@ -31,6 +31,7 @@ description: |
|
||||
ipq6018
|
||||
ipq8074
|
||||
mdm9615
|
||||
msm8226
|
||||
msm8916
|
||||
msm8974
|
||||
msm8992
|
||||
@ -114,6 +115,11 @@ properties:
|
||||
- qcom,apq8084-sbc
|
||||
- const: qcom,apq8084
|
||||
|
||||
- items:
|
||||
- enum:
|
||||
- samsung,s3ve3g
|
||||
- const: qcom,msm8226
|
||||
|
||||
- items:
|
||||
- enum:
|
||||
- qcom,msm8960-cdp
|
||||
@ -129,6 +135,8 @@ properties:
|
||||
- const: qcom,msm8974
|
||||
|
||||
- items:
|
||||
- enum:
|
||||
- alcatel,idol347
|
||||
- const: qcom,msm8916-mtp/1
|
||||
- const: qcom,msm8916-mtp
|
||||
- const: qcom,msm8916
|
||||
@ -181,6 +189,8 @@ properties:
|
||||
- items:
|
||||
- enum:
|
||||
- qcom,sc7280-idp
|
||||
- qcom,sc7280-idp2
|
||||
- google,piglin
|
||||
- google,senor
|
||||
- const: qcom,sc7280
|
||||
|
||||
|
@ -238,17 +238,29 @@ properties:
|
||||
- const: renesas,r8a77961
|
||||
|
||||
- description: Kingfisher (SBEV-RCAR-KF-M03)
|
||||
items:
|
||||
- const: shimafuji,kingfisher
|
||||
- enum:
|
||||
- renesas,h3ulcb
|
||||
- renesas,m3ulcb
|
||||
- renesas,m3nulcb
|
||||
- enum:
|
||||
- renesas,r8a7795
|
||||
- renesas,r8a7796
|
||||
- renesas,r8a77961
|
||||
- renesas,r8a77965
|
||||
oneOf:
|
||||
- items:
|
||||
- const: shimafuji,kingfisher
|
||||
- enum:
|
||||
- renesas,h3ulcb
|
||||
- renesas,m3ulcb
|
||||
- renesas,m3nulcb
|
||||
- enum:
|
||||
- renesas,r8a7795
|
||||
- renesas,r8a7796
|
||||
- renesas,r8a77961
|
||||
- renesas,r8a77965
|
||||
- items:
|
||||
- const: shimafuji,kingfisher
|
||||
- enum:
|
||||
- renesas,h3ulcb
|
||||
- renesas,m3ulcb
|
||||
- enum:
|
||||
- renesas,r8a779m1
|
||||
- renesas,r8a779m3
|
||||
- enum:
|
||||
- renesas,r8a7795
|
||||
- renesas,r8a77961
|
||||
|
||||
- description: R-Car M3-N (R8A77965)
|
||||
items:
|
||||
@ -296,6 +308,22 @@ properties:
|
||||
- const: renesas,falcon-cpu
|
||||
- const: renesas,r8a779a0
|
||||
|
||||
- description: R-Car H3e-2G (R8A779M1)
|
||||
items:
|
||||
- enum:
|
||||
- renesas,h3ulcb # H3ULCB (R-Car Starter Kit Premier)
|
||||
- renesas,salvator-xs # Salvator-XS (Salvator-X 2nd version)
|
||||
- const: renesas,r8a779m1
|
||||
- const: renesas,r8a7795
|
||||
|
||||
- description: R-Car M3e-2G (R8A779M3)
|
||||
items:
|
||||
- enum:
|
||||
- renesas,m3ulcb # M3ULCB (R-Car Starter Kit Pro)
|
||||
- renesas,salvator-xs # Salvator-XS (Salvator-X 2nd version)
|
||||
- const: renesas,r8a779m3
|
||||
- const: renesas,r8a77961
|
||||
|
||||
- description: RZ/N1D (R9A06G032)
|
||||
items:
|
||||
- enum:
|
||||
|
@ -54,7 +54,7 @@ properties:
|
||||
- const: toradex,apalis_t30
|
||||
- const: nvidia,tegra30
|
||||
- items:
|
||||
- const: toradex,apalis_t30-eval-v1.1
|
||||
- const: toradex,apalis_t30-v1.1-eval
|
||||
- const: toradex,apalis_t30-eval
|
||||
- const: toradex,apalis_t30-v1.1
|
||||
- const: toradex,apalis_t30
|
||||
@ -111,6 +111,7 @@ properties:
|
||||
- items:
|
||||
- enum:
|
||||
- nvidia,p2771-0000
|
||||
- nvidia,p3509-0000+p3636-0001
|
||||
- const: nvidia,tegra186
|
||||
- items:
|
||||
- enum:
|
||||
|
@ -1,30 +0,0 @@
|
||||
* Samsung AHCI SATA Controller
|
||||
|
||||
SATA nodes are defined to describe on-chip Serial ATA controllers.
|
||||
Each SATA controller should have its own node.
|
||||
|
||||
Required properties:
|
||||
- compatible : compatible list, contains "samsung,exynos5-sata"
|
||||
- interrupts : <interrupt mapping for SATA IRQ>
|
||||
- reg : <registers mapping>
|
||||
- samsung,sata-freq : <frequency in MHz>
|
||||
- phys : Must contain exactly one entry as specified
|
||||
in phy-bindings.txt
|
||||
- phy-names : Must be "sata-phy"
|
||||
|
||||
Optional properties:
|
||||
- clocks : Must contain an entry for each entry in clock-names.
|
||||
- clock-names : Shall be "sata" for the external SATA bus clock,
|
||||
and "sclk_sata" for the internal controller clock.
|
||||
|
||||
Example:
|
||||
sata@122f0000 {
|
||||
compatible = "snps,dwc-ahci";
|
||||
samsung,sata-freq = <66>;
|
||||
reg = <0x122f0000 0x1ff>;
|
||||
interrupts = <0 115 0>;
|
||||
clocks = <&clock 277>, <&clock 143>;
|
||||
clock-names = "sata", "sclk_sata";
|
||||
phys = <&sata_phy>;
|
||||
phy-names = "sata-phy";
|
||||
};
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user