bf0cf32191
Changeset 62fdac5913
("x86, mce: Add boot options for corrected errors")
added three more MCE files that are also exposed currently via
sysfs.
Document them.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/f9c13aff3f5a2cae0d4495eccd17f431f3a4c55a.1632994837.git.mchehab+huawei@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
130 lines
4.4 KiB
Plaintext
130 lines
4.4 KiB
Plaintext
What: /sys/devices/system/machinecheck/machinecheckX/
|
|
Contact: Andi Kleen <ak@linux.intel.com>
|
|
Date: Feb, 2007
|
|
Description:
|
|
(X = CPU number)
|
|
|
|
Machine checks report internal hardware error conditions
|
|
detected by the CPU. Uncorrected errors typically cause a
|
|
machine check (often with panic), corrected ones cause a
|
|
machine check log entry.
|
|
|
|
For more details about the x86 machine check architecture
|
|
see the Intel and AMD architecture manuals from their
|
|
developer websites.
|
|
|
|
For more details about the architecture
|
|
see http://one.firstfloor.org/~andi/mce.pdf
|
|
|
|
Each CPU has its own directory.
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/bank<Y>
|
|
Contact: Andi Kleen <ak@linux.intel.com>
|
|
Date: Feb, 2007
|
|
Description:
|
|
(Y bank number)
|
|
|
|
64bit Hex bitmask enabling/disabling specific subevents for
|
|
bank Y.
|
|
|
|
When a bit in the bitmask is zero then the respective
|
|
subevent will not be reported.
|
|
|
|
By default all events are enabled.
|
|
|
|
Note that BIOS maintain another mask to disable specific events
|
|
per bank. This is not visible here
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/check_interval
|
|
Contact: Andi Kleen <ak@linux.intel.com>
|
|
Date: Feb, 2007
|
|
Description:
|
|
The entries appear for each CPU, but they are truly shared
|
|
between all CPUs.
|
|
|
|
How often to poll for corrected machine check errors, in
|
|
seconds (Note output is hexadecimal). Default 5 minutes.
|
|
When the poller finds MCEs it triggers an exponential speedup
|
|
(poll more often) on the polling interval. When the poller
|
|
stops finding MCEs, it triggers an exponential backoff
|
|
(poll less often) on the polling interval. The check_interval
|
|
variable is both the initial and maximum polling interval.
|
|
0 means no polling for corrected machine check errors
|
|
(but some corrected errors might be still reported
|
|
in other ways)
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/tolerant
|
|
Contact: Andi Kleen <ak@linux.intel.com>
|
|
Date: Feb, 2007
|
|
Description:
|
|
The entries appear for each CPU, but they are truly shared
|
|
between all CPUs.
|
|
|
|
Tolerance level. When a machine check exception occurs for a
|
|
non corrected machine check the kernel can take different
|
|
actions.
|
|
|
|
Since machine check exceptions can happen any time it is
|
|
sometimes risky for the kernel to kill a process because it
|
|
defies normal kernel locking rules. The tolerance level
|
|
configures how hard the kernel tries to recover even at some
|
|
risk of deadlock. Higher tolerant values trade potentially
|
|
better uptime with the risk of a crash or even corruption
|
|
(for tolerant >= 3).
|
|
|
|
== ===========================================================
|
|
0 always panic on uncorrected errors, log corrected errors
|
|
1 panic or SIGBUS on uncorrected errors, log corrected errors
|
|
2 SIGBUS or log uncorrected errors, log corrected errors
|
|
3 never panic or SIGBUS, log all errors (for testing only)
|
|
== ===========================================================
|
|
|
|
Default: 1
|
|
|
|
Note this only makes a difference if the CPU allows recovery
|
|
from a machine check exception. Current x86 CPUs generally
|
|
do not.
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/trigger
|
|
Contact: Andi Kleen <ak@linux.intel.com>
|
|
Date: Feb, 2007
|
|
Description:
|
|
The entries appear for each CPU, but they are truly shared
|
|
between all CPUs.
|
|
|
|
Program to run when a machine check event is detected.
|
|
This is an alternative to running mcelog regularly from cron
|
|
and allows to detect events faster.
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/monarch_timeout
|
|
Contact: Andi Kleen <ak@linux.intel.com>
|
|
Date: Feb, 2007
|
|
Description:
|
|
How long to wait for the other CPUs to machine check too on a
|
|
exception. 0 to disable waiting for other CPUs.
|
|
|
|
Unit: us
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/ignore_ce
|
|
Contact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
|
|
Date: Jun 2009
|
|
Description:
|
|
Disables polling and CMCI for corrected errors.
|
|
All corrected events are not cleared and kept in bank MSRs.
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/dont_log_ce
|
|
Contact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
|
|
Date: Jun 2009
|
|
Description:
|
|
Disables logging for corrected errors.
|
|
All reported corrected errors will be cleared silently.
|
|
|
|
This option will be useful if you never care about corrected
|
|
errors.
|
|
|
|
What: /sys/devices/system/machinecheck/machinecheckX/cmci_disabled
|
|
Contact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
|
|
Date: Jun 2009
|
|
Description:
|
|
Disables the CMCI feature.
|