Re: [PATCH 1/2] ghes_edac: refactor memory error location processing

From: Robert Richter
Date: Tue Dec 07 2021 - 06:31:08 EST


On 07.12.21 11:19:04, Shuai Xue wrote:
> The memory error location processing in ghes_edac_report_mem_error() have
> Duplicated Code with cper_mem_err_location(), cper_dimm_err_location(), and
> cper_mem_err_type_str() in drivers/firmware/efi/cper.c.
>
> To avoid the duplicated code, this patch introduces the above cper_*() into
> ghes_edac_report_mem_error().

It is not really duplicate yet, changes are slightly different which
could trigger problems in some parsers. At least those differences
should be listed in the patch description. I would rather remove the
'space' delimiter after the colon and take the ghes version of it as
logs become harder to read. So ideally there is a unification patch
before the "duplication" is removed with changes in both files as
necessary for review and to document the change.

>
> The EDAC error log is now properly reporting the error as follows (all
> Validation Bits are enabled):
>
> [ 375.938411] EDAC MC0: 1 CE single-symbol chipkill ECC on unknown memory (node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 page:0x898b86 offset:0x20 grain:1 syndrome:0x0 - APEI location: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000 DIMM location: not present. DMI handle: 0x0000 status(0x0000000000000000): reserved)
> [ 375.938416] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2
> [ 375.938417] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
> [ 375.938418] {2}[Hardware Error]: event severity: corrected
> [ 375.938419] {2}[Hardware Error]: Error 0, type: corrected
> [ 375.938420] {2}[Hardware Error]: section_type: memory error
> [ 375.938421] {2}[Hardware Error]: error_status: 0x0000000000000000
> [ 375.938422] {2}[Hardware Error]: physical_address: 0x0000000898b86020
> [ 375.938422] {2}[Hardware Error]: physical_address_mask: 0x0000000000000000
> [ 375.938426] {2}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 513 bank_group: 2 bank_address: 1 device: 0 row: 4887 column: 1032 bit_position: 0 requestor_id: 0x0000000000000000 responder_id: 0x0000000000000000
> [ 375.938426] {2}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
> [ 375.938428] {2}[Hardware Error]: DIMM location: not present. DMI handle: 0x0000
>
> Signed-off-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>


> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 6ec8edec6329..08eabb2e23f8 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -211,7 +211,7 @@ const char *cper_mem_err_type_str(unsigned int etype)
> }
> EXPORT_SYMBOL_GPL(cper_mem_err_type_str);
>
> -static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> +int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> {
> u32 len, n;
>
> @@ -265,7 +265,7 @@ static int cper_mem_err_location(struct cper_mem_err_compact *mem, char *msg)
> return n;
> }
>
> -static int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
> +int cper_dimm_err_location(struct cper_mem_err_compact *mem, char *msg)
> {
> u32 len, n;
> const char *bank = NULL, *device = NULL;

Even though the ghes driver cannot be built as module,
EXPORT_SYMBOL_GPL()s should be added for both.

It would be good to add a note to the description that the
UEFI_CPER/EDAC_GHES dependency is always solved through
ACPI_APEI_GHES/ACPI_APEI. But we should make the UEFI_CPER dependency
explicit for EDAC_GHES in Kconfig anyway.

-Robert