Re: perf, ftrace and MCEs

From: Borislav Petkov
Date: Sat May 15 2010 - 09:43:36 EST


From: Ingo Molnar <mingo@xxxxxxx>
Date: Tue, May 04, 2010 at 01:32:27PM +0200

Hi,

> To start with this, a quick initial prototype could use the 'perf trace' live
> mode tracing script. (See latest -tip, 'perf trace --script <script-name>' and
> 'perf record -o -' to activate live mode.)

so I did some experimenting with this and have a pretty rough prototype
which conveys decoded MCEs to userspace where they're read with perf.
More specifically, I did

perf record -e mce:mce_record -a

after tweaking the mce_record tracepoint to include the decoded error
string.

And then doing

perf trace -g python
perf trace -s perf-trace.py

got me:

in trace_begin
mce__mce_record 6 00600.700632283 0 init mcgcap=262, mcgstatus=0, bank=4, status=15888347641659525651, addr=26682366720, misc=13837309867997528064, ip=0, cs=0, tsc=0, walltime=1273928155, cpu=6, cpuid=1052561, apicid=6, socketid=0, cpuvendor=2, decoded_err= Northbridge Error, node 1ECC/ChipKill ECC error.
CE err addr: 0x636649b00
CE page 0x636649, offset 0xb00, grain 0, syndrome 0x1fd, row 3, channel 0
Transaction type: generic read(mem access), no t
in trace_end

which shows the signature of an ECC which I injected earlier over the
EDAC sysfs interface. And yes, the decoded_err appears truncated so I'll
have to think of a slicker way to collect that info.

Although they're pretty rough yet, I've attached the relevant patches so
that one could get an impression of where we're moving here.

0001-amd64_edac-Remove-polling-mechanism.patch removes the EDAC
polling mechanism in favor of hooking into the machine_check_poll
polling function using the atomic notifier which we already use for
uncorrectable errors.

The other two

0002-mce-trace-Add-decoded-string-to-mce_record-s-format.patch
0003-edac-mce-Prepare-error-decoded-info.patch

add that decoded_err string. I'm open for better ideas here though.

Concerning the early MCE logging and reporting, I'm thinking of using
the mce.c ring buffer temporarily until the ftrace buffer has been
initialized and then copying all records into the last. We might do a
more elegant solution in the future after all that bootmem churn has
quieted down and allocate memory early for a dedicated MCE ring buffer
or whatever.

Wrt critical MCEs, I'm leaning towards bypassing perf/ftrace subsystem
altogether in favor of executing the smallest amount of code possible
like, for example, switching to a tty, dumping the decoded error and
in certain cases not panicking but shutting down gracefully after a
timeout. Of course, graceful shutdown is completely dependent on the
type of hw failure and in some cases we can't do anything else but
freeze in order to prevent faulty data propagation.

I'm sure there's more...

Thanks.

--
Regards/Gruss,
Boris.