[PATCH] RAS/CEC: Reduce offline page threshold for Intel systems

From: Tony Luck
Date: Fri Jul 01 2022 - 15:13:00 EST


A large scale study of memory errors on Intel systems in data centers
showed that aggressively taking pages with corrected errors offline is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.

It is unknown whether this would help other vendors. There are some
indicators that it would not.

Set the threshold to "2" on Intel systems.

Do-not-apply-without-agreement-from-AMD
Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
---
drivers/ras/cec.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..b1fc193b2036 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -556,6 +556,14 @@ static int __init cec_init(void)
if (ce_arr.disabled)
return -ENODEV;

+ /*
+ * Intel systems may avoid uncorreectable errors
+ * if pages with corrected errors are aggresively
+ * taken offline.
+ */
+ if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+ action_threshold = 2;
+
ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL);
if (!ce_arr.array) {
pr_err("Error allocating CE array page!\n");
--
2.35.3