[PATCH v2 4/7] ABI: sysfs-mce: add a new ABI file

From: Mauro Carvalho Chehab
Date: Thu Sep 30 2021 - 05:45:06 EST


Reduce the gap of missing ABIs for Intel servers with MCE
by adding a new ABI file.

The contents of this file comes from:
Documentation/x86/x86_64/machinecheck.rst

Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
---

See [PATCH v2 0/7] at: https://lore.kernel.org/all/cover.1632994837.git.mchehab+huawei@xxxxxxxxxx/

Documentation/ABI/testing/sysfs-mce | 107 ++++++++++++++++++++++
Documentation/x86/x86_64/machinecheck.rst | 56 +----------
MAINTAINERS | 2 +
3 files changed, 111 insertions(+), 54 deletions(-)
create mode 100644 Documentation/ABI/testing/sysfs-mce

diff --git a/Documentation/ABI/testing/sysfs-mce b/Documentation/ABI/testing/sysfs-mce
new file mode 100644
index 000000000000..686fbfa02cdc
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-mce
@@ -0,0 +1,107 @@
+What: /sys/devices/system/machinecheck/machinecheckX/
+Contact: Andi Kleen <ak@xxxxxxxxxxxxxxx>
+Date: Feb, 2007
+Description:
+ (X = CPU number)
+
+ Machine checks report internal hardware error conditions
+ detected by the CPU. Uncorrected errors typically cause a
+ machine check (often with panic), corrected ones cause a
+ machine check log entry.
+
+ For more details about the x86 machine check architecture
+ see the Intel and AMD architecture manuals from their
+ developer websites.
+
+ For more details about the architecture
+ see http://one.firstfloor.org/~andi/mce.pdf
+
+ Each CPU has its own directory.
+
+What: /sys/devices/system/machinecheck/machinecheckX/bank<Y>
+Contact: Andi Kleen <ak@xxxxxxxxxxxxxxx>
+Date: Feb, 2007
+Description:
+ (Y bank number)
+
+ 64bit Hex bitmask enabling/disabling specific subevents for
+ bank Y.
+
+ When a bit in the bitmask is zero then the respective
+ subevent will not be reported.
+
+ By default all events are enabled.
+
+ Note that BIOS maintain another mask to disable specific events
+ per bank. This is not visible here
+
+What: /sys/devices/system/machinecheck/machinecheckX/check_interval
+Contact: Andi Kleen <ak@xxxxxxxxxxxxxxx>
+Date: Feb, 2007
+Description:
+ The entries appear for each CPU, but they are truly shared
+ between all CPUs.
+
+ How often to poll for corrected machine check errors, in
+ seconds (Note output is hexadecimal). Default 5 minutes.
+ When the poller finds MCEs it triggers an exponential speedup
+ (poll more often) on the polling interval. When the poller
+ stops finding MCEs, it triggers an exponential backoff
+ (poll less often) on the polling interval. The check_interval
+ variable is both the initial and maximum polling interval.
+ 0 means no polling for corrected machine check errors
+ (but some corrected errors might be still reported
+ in other ways)
+
+What: /sys/devices/system/machinecheck/machinecheckX/tolerant
+Contact: Andi Kleen <ak@xxxxxxxxxxxxxxx>
+Date: Feb, 2007
+Description:
+ The entries appear for each CPU, but they are truly shared
+ between all CPUs.
+
+ Tolerance level. When a machine check exception occurs for a
+ non corrected machine check the kernel can take different
+ actions.
+
+ Since machine check exceptions can happen any time it is
+ sometimes risky for the kernel to kill a process because it
+ defies normal kernel locking rules. The tolerance level
+ configures how hard the kernel tries to recover even at some
+ risk of deadlock. Higher tolerant values trade potentially
+ better uptime with the risk of a crash or even corruption
+ (for tolerant >= 3).
+
+ == ===========================================================
+ 0 always panic on uncorrected errors, log corrected errors
+ 1 panic or SIGBUS on uncorrected errors, log corrected errors
+ 2 SIGBUS or log uncorrected errors, log corrected errors
+ 3 never panic or SIGBUS, log all errors (for testing only)
+ == ===========================================================
+
+ Default: 1
+
+ Note this only makes a difference if the CPU allows recovery
+ from a machine check exception. Current x86 CPUs generally
+ do not.
+
+What: /sys/devices/system/machinecheck/machinecheckX/trigger
+Contact: Andi Kleen <ak@xxxxxxxxxxxxxxx>
+Date: Feb, 2007
+Description:
+ The entries appear for each CPU, but they are truly shared
+ between all CPUs.
+
+ Program to run when a machine check event is detected.
+ This is an alternative to running mcelog regularly from cron
+ and allows to detect events faster.
+
+What: /sys/devices/system/machinecheck/machinecheckX/monarch_timeout
+Contact: Andi Kleen <ak@xxxxxxxxxxxxxxx>
+Date: Feb, 2007
+Description:
+ How long to wait for the other CPUs to machine check too on a
+ exception. 0 to disable waiting for other CPUs.
+
+ Unit: us
+
diff --git a/Documentation/x86/x86_64/machinecheck.rst b/Documentation/x86/x86_64/machinecheck.rst
index b402e04bee60..cea12ee97200 100644
--- a/Documentation/x86/x86_64/machinecheck.rst
+++ b/Documentation/x86/x86_64/machinecheck.rst
@@ -21,60 +21,8 @@ from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
(N = CPU number).

-The directory contains some configurable entries:
-
-bankNctl
- (N bank number)
-
- 64bit Hex bitmask enabling/disabling specific subevents for bank N
- When a bit in the bitmask is zero then the respective
- subevent will not be reported.
- By default all events are enabled.
- Note that BIOS maintain another mask to disable specific events
- per bank. This is not visible here
-
-The following entries appear for each CPU, but they are truly shared
-between all CPUs.
-
-check_interval
- How often to poll for corrected machine check errors, in seconds
- (Note output is hexadecimal). Default 5 minutes. When the poller
- finds MCEs it triggers an exponential speedup (poll more often) on
- the polling interval. When the poller stops finding MCEs, it
- triggers an exponential backoff (poll less often) on the polling
- interval. The check_interval variable is both the initial and
- maximum polling interval. 0 means no polling for corrected machine
- check errors (but some corrected errors might be still reported
- in other ways)
-
-tolerant
- Tolerance level. When a machine check exception occurs for a non
- corrected machine check the kernel can take different actions.
- Since machine check exceptions can happen any time it is sometimes
- risky for the kernel to kill a process because it defies
- normal kernel locking rules. The tolerance level configures
- how hard the kernel tries to recover even at some risk of
- deadlock. Higher tolerant values trade potentially better uptime
- with the risk of a crash or even corruption (for tolerant >= 3).
-
- 0: always panic on uncorrected errors, log corrected errors
- 1: panic or SIGBUS on uncorrected errors, log corrected errors
- 2: SIGBUS or log uncorrected errors, log corrected errors
- 3: never panic or SIGBUS, log all errors (for testing only)
-
- Default: 1
-
- Note this only makes a difference if the CPU allows recovery
- from a machine check exception. Current x86 CPUs generally do not.
-
-trigger
- Program to run when a machine check event is detected.
- This is an alternative to running mcelog regularly from cron
- and allows to detect events faster.
-monarch_timeout
- How long to wait for the other CPUs to machine check too on a
- exception. 0 to disable waiting for other CPUs.
- Unit: us
+The directory contains some configurable entries. See
+Documentation/ABI/testing/sysfs-mce for more details.

TBD document entries for AMD threshold interrupt configuration

diff --git a/MAINTAINERS b/MAINTAINERS
index e9fd362ef4d6..360311ea0b43 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20457,6 +20457,8 @@ M: Tony Luck <tony.luck@xxxxxxxxx>
M: Borislav Petkov <bp@xxxxxxxxx>
L: linux-edac@xxxxxxxxxxxxxxx
S: Maintained
+F: Documentation/ABI/testing/sysfs-mce
+F: Documentation/x86/x86_64/machinecheck.rst
F: arch/x86/kernel/cpu/mce/*

X86 MICROCODE UPDATE SUPPORT
--
2.31.1