Re: Hardware Error Kernel Mini-Summit

From: Ingo Molnar
Date: Tue May 18 2010 - 18:30:06 EST



* Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:

> > [...]
> >
> > Concerning critical errors, there we bypass the perf
> > subsystem and execute the smallest amount of code
> > possible while trying to shutdown gracefully if the
> > error type allows that.
> >
> > These are the rough ideas at least...
>
> Can someone please tell me why everyone is eager to
> squirrel correctable error reports away and not report
> them in dmesg? aka syslog.
>
> I have had on several occasions a machine with memory
> errors that mcelog or the BIOS was eating the error
> reports and not putting them anywhere a normal human
> being would look.

That's possible too - the TRACE_EVENT() of MCE events,
beyond the record format, also includes a human-readable
ASCII output format string:

# tail -1 /debug/tracing/events/mce/mce_record/format

print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx,
ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx,
PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x",
REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank,
REC->status, REC->addr, REC->misc, REC->cs, REC->ip,
REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime,
REC->socketid, REC->apicid

Which could be used to printk events.

Cheers,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/