Re: [RFC 0/9] mce recovery for Sandy Bridge server

From: Borislav Petkov
Date: Tue May 24 2011 - 04:14:47 EST


On Tue, May 24, 2011 at 05:40:23AM +0200, Ingo Molnar wrote:
> So we *really* want to promote this code to a higher level of abstraction.
> Everyone would benefit from doing that: Intel hardware error handling features
> would be enabled much more richly and i suspect they would also be *used* in a
> much more meaningful way - driving the hw cycle as well.

Absolutely agreed. The RAS architecture should look like this, IMHO:

I. Event collection: #MC handler and pollers, no queueing or buffering crap.

II. Pluggable and extensible filters which are
* per vendor
* configurable from userspace
* easily extensible
* decide whether action should be taken in the kernel or error is non-critical
and should go to RAS daemon

III. Error handling callback(s)
* also extensible
* also per vendor
* also configurable from userspace

Advantages:
* reuse perf code - no need for ad-hoc new buffers and lockless thingies when we
have it all already

* easy code and even hw testing with perf inject or ras inject
** this gives us also the different injection methods per vendor in an unified
way instead of interfaces in /sys or debugfs or mcelog or ...

* keep code design sane instead of letting it needlessly fiddle with
other parts of the kernel

* ...

Now I should better go and put my patches where my mouth is :).

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/