Re: [PATCH 3/3] mce: acpi/apei: trace: Enable ghes memory error traceevent

From: Mauro Carvalho Chehab
Date: Wed Aug 14 2013 - 20:05:47 EST


Em Wed, 14 Aug 2013 07:43:22 +0200
Borislav Petkov <bp@xxxxxxxxx> escreveu:

> On Tue, Aug 13, 2013 at 08:13:56PM +0000, Luck, Tony wrote:
> > Generic tracepoints are architected to be able to fire at very high
> > rates and log huge amounts of information. So we'd need something
> > special to say just log these special tracepoints to network/serial.
> >
> > > Which reminds me, pstore could also be a good thing to use, in addition.
> > > Only put error info there as it is limited anyway.
> >
> > Yes - space is very limited. I don't know how to assign priority for logging
> > the dmesg data vs. some error logs.
>
> Didn't we say at some point, "log only the panic messsage which kills
> the machine"?

EDAC core allows those kind of things, and even panic when errors arrive:

$ modinfo edac_core
filename: /lib/modules/3.10.5-201.fc19.x86_64/kernel/drivers/edac/edac_core.ko
...
parm: edac_pci_panic_on_pe:Panic on PCI Bus Parity error: 0=off 1=on (int)
parm: edac_mc_panic_on_ue:Panic on uncorrected error: 0=off 1=on (int)
parm: edac_mc_log_ue:Log uncorrectable error to console: 0=off 1=on (int)
parm: edac_mc_log_ce:Log correctable error to console: 0=off 1=on (int)

Those have 644 permission, so they can be changed at runtime.

Of course, there are space for improvements.

> However, we probably could use more the messages before that
> catastrophic event because they could give us hints about what lead to
> the panic but in that case maybe a limited pstore is the wrong logging
> medium.
>
> Actually, I can imagine the full serial/network logs of "special"
> tracepoints + dmesg to be the optimal thing.
>
> > If we just "printk()" the most important parts - then that data will
> > automatically flow to the serial console and to pstore.
>
> Actually, does the pstore act like a circular buffer? Because if it
> contains the last N relevant messages (for an arbitrary definition of
> relevant) before the system dies, then that could more helpful than only
> the error messages.
>
> And with the advent of UEFI, pretty much every system has a pstore. Too
> bad that we have to limit it to 50% of size so that the boxes don't
> brick. :-P
>
> > Then we have multiple paths for the critical bits of the error log
> > - and the tracepoints give us more details for the cases where the
> > machine doesn't spontaneously explode.
>
> Ok, let's sort:
>
> * First we have the not-so-critical hw error messages. We want to carry
> those out-of-band, i.e. not in dmesg so that people don't have to parse
> and collect dmesg but have a specialized solution which gives them
> structured logs and tools can analyze, collect and ... those errors.
>
> * When a critical error happens, the above usage is not necessarily
> advantageous anymore in the sense that, in order to debug what caused
> the machine to crash, we don't simply necessarily want only the crash
> message but also the whole system activity that lead to it.
>
> In which case, we probably actually want to turn off/ignore the error
> logging tracepoints and write *only* to dmesg which goes out over serial
> and to pstore. Right?
>
> Because in such cases I want to have *all* *relevant* messages that lead
> to the explosion + the explosion message itself.
>
> Makes sense? Yes, no? Aspects I've missed?

Makes sense to me.

>
> Thanks.
>


--

Cheers,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/