RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

From: Shiju Jose
Date: Thu Sep 10 2020 - 11:32:41 EST


Hello Boris,

>-----Original Message-----
>From: Borislav Petkov [mailto:bp@xxxxxxxxx]
>Sent: 09 September 2020 13:02
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>
>Cc: linux-edac@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; linux-
>kernel@xxxxxxxxxxxxxxx; tony.luck@xxxxxxxxx; rjw@xxxxxxxxxxxxx;
>james.morse@xxxxxxx; lenb@xxxxxxxxxx; Linuxarm
><linuxarm@xxxxxxxxxx>
>Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
>an erroneous CPU core
>
>On Tue, Sep 01, 2020 at 04:20:54PM +0000, Shiju Jose wrote:
>> CPU CEC derived the infrastructure of the CEC only and the logic used
>> in the CEC for CE count storage, CE count calculation and page
>> isolation is very unique for the memory pages, which seems cannot be
>> reusable for the CPU CEs.
>
>Oh, because it saves the reported error's PFN and you want to save
>
>[CPU num | error count]
>
>?
Yes.

>
>Well, you can easily change that by extending the existing CEC to have a
>different storage format for CPU errors, i.e., use a different ce_array which
>gets passed to the functions anyway.
Ok. However the functions such as __find_elem() use
memory specific PFN() and PAGE_SHIFT.

>
>> Also the values set for the parameters such as threshold, time period
>> for the memory errors and CPU errors would be different.
>
>And your implementation with sliding windows is so totally different that it
>warrants the duplication of the code? I don't think so.
>
>You can use the current CEC to do exactly what you wanna do, with the
>decaying and so on.
I will check this.
For CPU, the corrected errors count for a short time period to be checked.
Thus old errors outside this period would not be considered and would be cleared.
It is not clear to me whether in the current CEC, the count for the old errors outside
a time period would be excluded for the threshold check or removed?

>
>Because all you wanna do is count the errors a CPU triggered.
>
>However, a CPU can trigger a *lot* of different types of errors.
>You're putting them all in the same basket by doing:
>
> else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
> /* add to CEC */
>
>and only for correctable.
>
>What type of errors get reported in CPER_SEC_PROC_ARM?
According to the ARM Processor CPER definition the error types reported are
Cache Error, TLB Error, Bus Error and micro-architectural Error.

>
>If they're all lumped together and if some functional unit generates a lot of
>errors, instead of disabling that unit only, you'll go and remove the whole
>CPU?
>
Few thoughts on this,
1. Not sure will a CPU core would work/perform as normal after disabling
a functional unit?
2. Support in the HW to disable a function unit alone may not available.
3. If it is require to store and retrieve the error count based on functional unit,
then CEC will become more complex?

>Doesn't make a whole lot of sense to me.
>
>How about you define what exactly you're trying to solve, maybe give an
>example of a real issue someone is encountering and you're trying to
>address? Because there was never a necessity so far to disable CPUs on
>x86 due to correctable errors. Why is that needed on ARM?
>
This requirement is the part of the early fault prediction by taking action
when large number of corrected errors reported on a CPU core
before it causing serious faults.
We are mainly looking for disable CPU core on large number of L1/L2 cache
corrected errors reported on a CPU core. Can we add atleast removing CPU core
for the CPU cache corrected errors filtering out other error types?

[...]

Thanks,
Shiju