Re: x86/mce merge, integration hickup + crash, design thoughts

From: Tim Hockin
Date: Wed Jan 14 2009 - 14:33:06 EST


On Wed, Jan 14, 2009 at 10:05 AM, Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote:
>
>>
>> From my point of view: a single, consistent, easy logging interface
>> for the kernel to send *structured data* about hardware/system events
>> and errors up to userspace.
>
> Which kinds of events were you thinking of?
>
> So far we managed by cramming some other CPU events like thermal
> trip into "pseudo banks" in struct mce. Admittedly it's not the
> most pretty solution in the world, but it worked.

Yeah, no offense, but that's horrible :)

Ideally, I'd rather see a more generic conduit for all sorts of
events. Polled and exception MCEs. Thermal interrupts. MCE
threshold interrupts. EDAC polled errors. PCI-express errors. SATA
disk timeouts.

Now I know there are different conduits for some events - netlink
tells me about netif link up/down events I think. I would settle for
a small number of interfaces. What I don't want is what we have today
- EVERYTHING has a different interface. Some are poll()-able. Some
have to be actively polled. Some have to have a daemon listening or
else messages are dropped. Some have to parse logs. Puke.

Put it this way: Given a thousand machines, I want to gather,
collate, and correlate all these events. I want to be able to produce
a "life story" of sorts for a machine and for a data center. Once I
can do that, I can start to make predictive diagnoses more accurately,
and I can know how much these things actually COST us.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/