Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

From: Borislav Petkov
Date: Tue Sep 01 2020 - 10:36:12 EST


On Tue, Sep 01, 2020 at 03:01:40PM +0100, Shiju Jose wrote:
> When the CPU correctable errors reported on an ARM64 CPU core too often,
> it should be isolated. Add the CPU correctable error collector to
> store the CPU correctable error count.
>
> When the correctable error count for a CPU exceed the threshold
> value in a short time period, it will try to isolate the CPU core.
> The threshold value, time period etc are configurable.
>
> Implementation details is added in the file.
>
> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>
> ---
> Documentation/ABI/testing/debugfs-cpu-cec | 22 ++
> arch/arm64/ras/Kconfig | 8 +
> drivers/acpi/apei/ghes.c | 30 +-
> drivers/ras/Kconfig | 1 +
> drivers/ras/Makefile | 1 +
> drivers/ras/cpu_cec.c | 393 ++++++++++++++++++++++

So instead of adding the ability to collect other error types to the
CEC, you're duplicating the CEC itself?!

Why?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette