Re: [PATCH] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on action required events

From: Shuai Xue
Date: Wed Nov 02 2022 - 03:07:40 EST




在 2022/10/29 AM1:08, Rafael J. Wysocki 写道:
> On Thu, Oct 27, 2022 at 6:25 AM Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx> wrote:
>>
>> There are two major types of uncorrected error (UC) :
>>
>> - Action Required: The error is detected and the processor already consumes the
>> memory. OS requires to take action (for example, offline failure page/kill
>> failure thread) to recover this uncorrectable error.
>>
>> - Action Optional: The error is detected out of processor execution context.
>> Some data in the memory are corrupted. But the data have not been consumed.
>> OS is optional to take action to recover this uncorrectable error.
>>
>> For X86 platforms, we can easily distinguish between these two types
>> based on the MCA Bank. While for arm64 platform, the memory failure
>> flags for all UCs which severity are GHES_SEV_RECOVERABLE are set as 0,
>> a.k.a, Action Optional now.
>>
>> If UC is detected by a background scrubber, it is obviously an Action
>> Optional error. For other errors, we should conservatively regard them
>> as Action Required.
>>
>> cper_sec_mem_err::error_type identifies the type of error that occurred
>> if CPER_MEM_VALID_ERROR_TYPE is set. So, set memory failure flags as 0
>> for Scrub Uncorrected Error (type 14). Otherwise, set memory failure
>> flags as MF_ACTION_REQUIRED.
>>
>> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
>
> I need input from the APEI reviewers on this.
>
> Thanks!

Hi, Rafael,

Sorry, I missed this email. Thank you for you quick reply. Let's discuss with
reviewers.

Thank you.

Cheers,
Shuai


>
>> ---
>> drivers/acpi/apei/ghes.c | 10 ++++++++--
>> include/linux/cper.h | 3 +++
>> 2 files changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 80ad530583c9..6c03059cbfc6 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -474,8 +474,14 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata,
>> if (sec_sev == GHES_SEV_CORRECTED &&
>> (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED))
>> flags = MF_SOFT_OFFLINE;
>> - if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE)
>> - flags = 0;
>> + if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) {
>> + if (mem_err->validation_bits & CPER_MEM_VALID_ERROR_TYPE)
>> + flags = mem_err->error_type == CPER_MEM_SCRUB_UC ?
>> + 0 :
>> + MF_ACTION_REQUIRED;
>> + else
>> + flags = MF_ACTION_REQUIRED;
>> + }
>>
>> if (flags != -1)
>> return ghes_do_memory_failure(mem_err->physical_addr, flags);
>> diff --git a/include/linux/cper.h b/include/linux/cper.h
>> index eacb7dd7b3af..b77ab7636614 100644
>> --- a/include/linux/cper.h
>> +++ b/include/linux/cper.h
>> @@ -235,6 +235,9 @@ enum {
>> #define CPER_MEM_VALID_BANK_ADDRESS 0x100000
>> #define CPER_MEM_VALID_CHIP_ID 0x200000
>>
>> +#define CPER_MEM_SCRUB_CE 13
>> +#define CPER_MEM_SCRUB_UC 14
>> +
>> #define CPER_MEM_EXT_ROW_MASK 0x3
>> #define CPER_MEM_EXT_ROW_SHIFT 16
>>
>> --
>> 2.20.1.9.gb50a0d7
>>