Re: [NAK] Re: [PATCH -v2 9/9] ACPI, APEI, Generic Hardware ErrorSource POLL/IRQ/NMI notification type support

From: Borislav Petkov
Date: Tue Oct 26 2010 - 02:27:08 EST


On Mon, Oct 25, 2010 at 04:35:43PM -0700, Tony Luck wrote:
> On Mon, Oct 25, 2010 at 2:51 PM, Borislav Petkov <bp@xxxxxxxxx> wrote:
> > Concerning fatal errors, take a look at drivers/edac/mce_amd.(c|h)Â -
> > this is not in arch/x86/ and still decodes MCEs in the kernel. And it
> > works fine - it even helped in several cases where people simply read
> > their serial console/dmesg and didn't have to collect it first and run
> > it through some tool to understand which functional unit in the CPU is
> > mchecking.
>
> That looks neat ... but end-users seem to have some conflicting requirements
> here. Your uses seem to like it but the LLNL folks at the S.F. meeting said
> that solutions that involved looking at console logs from thousands
> of machines in a cluster were not acceptable.
>
> I doubt very much if any end-user cares which unit *within* a cpu
> failed (their replaceable unit is the whole of the cpu). So much of
> your driver could be replaced with: printk("CPU%d is bad\n", cpu);

Yeah, nobody said this is finished. The next step is using perf
infrastructure to convey those decoded errors to userspace, say, to a
ras daemon or similar which can do all sorts of reporting, statistics,
policy decisions, injection, paint graphs, whatever...

I sent out two patchsets as an rfc already and am working
on the 3rd one so we're getting there. Here's the last one:
http://kerneltrap.org/mailarchive/linux-kernel/2010/8/6/4603847

Also, I'm open to all suggestions on how to make it more usable and
user-friendly.

Thanks.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/