Re: Hardware Error Kernel Mini-Summit

From: Hidetoshi Seto
Date: Tue May 18 2010 - 02:54:31 EST


(2010/05/18 3:23), Mauro Carvalho Chehab wrote:
> During the last LF Collaboration Summit, we've done a mini-summit [1],
> intended to improve the hardware error detection in kernel, currently
> provided by MCE and EDAC subsystems.
>
> The idea of this mini-summit came up after Thomas Gleixner and Ingo
> Molnar suggestions that edac and mce should converge into an error
> subsystem.
>
> I'm enclosing the minutes of the meeting, in order to allow it to be
> reviewed by other kernel hackers that are interested on the theme but
> unfortunately couldn't come to the meeting.
>
> Btw, during the meeting, it were decided that EDAC ML could better work
> if moved to vger, so I'm copying here both the old and the new edac
> mailing lists.
>
> [1] http://events.linuxfoundation.org/lfcs2010/edac
>
> ---

Thank you very much for providing this report.

I agree that we should have a well organized error subsystem that
covers all error sources in the system and that provides enough
simple and powerful API for users. As one of interested absentee,
I think I could be of some help to you (e.g. x86 low level).

It might be off-topic here, but I'd like to point that you missed
the presence of PCIe AER subsystem that handle hardware errors on
PCIe devices nowadays (It works well on ppc, x86 and so on).
Given that APEI also covers PCIe errors and that some system can
map MC registers to PCI configuration space, I think there is no
way for the new error subsystem to ignore I/O device errors while
it care errors on CPU/memory and cooperate with APEI.


Thanks,
H.Seto

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/