Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly ReportMechanism (HARM)

From: Mauro Carvalho Chehab
Date: Sat Mar 26 2011 - 07:57:22 EST


Em 25-03-2011 19:37, Tony Luck escreveu:
> On Fri, Mar 25, 2011 at 2:22 PM, Mauro Carvalho Chehab
> <mchehab@xxxxxxxxxx> wrote:
>> Em 25-03-2011 11:13, Borislav Petkov escreveu:
>>> However, there's
>>> another issue with fatal errors - you want to execute as less code as
>>> possible in the wake of a fatal error.
>>
>> Yes. That's one of the reasons why it may make sense to have a separate event
>> for fatal errors.
>
> We have three categories (severities):
> 1) Corrected - log these
> 2) Uncorrected-but-not-immediately-fatal - log these too
> 3) Fatal - all we can do with these is log to some persistent store (or
> to a serial console connected to a logging device). perf style event
> tracing doesn't help when all the userland daemons will never get a
> chance to run.

Ok. Assuming that fatal errors will be stored on some persistent way, on a next
boot, the daemon will be able to catch them. So, I think it would be a nice feature
to have 3 different trace events, in order to allow users to filter between them.
Alternatively, we may implement filtering capabilities on userspace, but as perf
has this already, I'm in favor of using what's there.

>> It would be good to use some non-volatile ram for these. I was told that
>> APEI spec defines a way for that, but I'm not sure if low end machines would
>> be shipped with that.
>
> You are talking about ERST - and you are right, this is generally not going
> to be present on low-end machines. drivers/acpi/apei/erst.c was accepted
> in 2.6.35. My /dev/pstore changes are in the current merge for 2.6.39 (but
> currently only show dmesg traces to the user).

It makes sense to integrate it on perf, after we add there a way to recover
persistent data when the daemon starts.

>> Alternatively, edac could fill a translation table, and the decoding code at
>> mce would be just a table retrieve routine (in order to speed-up translation,
>> in the case of fatal errors.
>
> Eventually the translation table should move above edac (to the drivers/ras/
> area that Borislav suggested earlier?) so that both mce and edac can access.
> I think we'll need this for some time as SMBIOS continues to disappoint
> me with its inaccuracies.

That makes sense to me. The translation table there is only for memories, currently.

The /ras table needs to be generic enough to cover other types of translation, like
for example, translating a cpu Kernel representation into a CPU socket label,
and a PCI BUS ID into a PCI slot number.

Mauro.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/