Re: [RFC] x86, mce: use of TRACE_EVENT for mce

From: Ingo Molnar
Date: Tue Oct 13 2009 - 04:45:01 EST



* Hidetoshi Seto <seto.hidetoshi@xxxxxxxxxxxxxx> wrote:

> Ingo Molnar wrote:
> > * Huang Ying <ying.huang@xxxxxxxxx> wrote:
> >
> >> I have talked with Ingo about this patch. But he has different idea
> >> about MCE log ring buffer and he didn't want to merge the patch even
> >> as an urgent bug fixes. It seems that another re-post can not convince
> >> him.
> >
> > Correct. The fixes are beyond what we can do in .32 - and for .33 i
> > outlined (with a patch) that we should be using not just the ftrace
> > ring-buffer (like your patch did) but perf events to expose MCE events.
> >
> > That brings MCE events to a whole new level of functionality.
> >
> > Event injection support would be an interesting new addition to
> > kernel/perf_event.c: non-MCE user-space wants to inject events as well -
> > both to simulate rare events, and to define their own user-space events.
> >
> > Is there any technical reason why we wouldnt want to take this far
> > superior approach?
> >
> > Ingo
>
> We could have more aggressive discussion if there is a real patch.
> This is an example.

That's the right attitude :-)

I've created a new topic tree for this approach: tip:perf/mce, and i've
committed your patch with a changelog outlining the approach, and pushed
it out. Please send delta patches against latest tip:master.

I think the next step should be to determine the rough 'event structure'
we want to map out. The mce_record event you added should be split up
some more.

For example we definitely want thermal events to be separate. One
approach would be the RFC patch i sent in "[PATCH] x86: mce: New MCE
logging design" - feel free to pick that up and iterate it.

A question would be whether each MCA/MCE bank should have a separate
event enumerated. I.e. right now 'perf list' shows:

mce:mce_record [Tracepoint event]

It might make sense to do something like:

mce:mce_bank_2 [Tracepoint event]
mce:mce_bank_3 [Tracepoint event]
mce:mce_bank_5 [Tracepoint event]
mce:mce_bank_6 [Tracepoint event]
mce:mce_bank_8 [Tracepoint event]

But this is pretty static and meaningless - so what i'd like to see is
to enumerate the _logical purpose_ of the MCE events, largely driven by
the physical source of the event:

$ perf list 2>&1 | grep mce
mce:mce_cpu [Tracepoint event]
mce:mce_thermal [Tracepoint event]
mce:mce_cache [Tracepoint event]
mce:mce_memory [Tracepoint event]
mce:mce_bus [Tracepoint event]
mce:mce_device [Tracepoint event]
mce:mce_other [Tracepoint event]

etc. - with a few simple rules about what type of event goes into which
category, such as:

- CPU internal errors go into mce_cpu
- memory or L3 cache related errors go into mce_memory
- L2 and lower level cache errors go into mce_cache
- general IO / bus / interconnect errors go into mce_bus
- specific device faults go into mce_device
- the rest goes into mce_other

Note - this is just a first rough guesstimate list - more can be added
and the definition can be made stricter. (Please suggest modifications
to this categorization.) Each event still has finegrained fields that
allows further disambiguation of precisely which event the CPU
generated.

Note that these categories will be largely CPU independent. Certain
models will offer events in all of these categories, some models will
only provide events in a very limited subset of these events.

The logical structure remains CPU model independent and tools, admins
and users can standardize on this generic 'logical overview' event
structure - instead of the current maze of model specific MCE decoding
with no real structure over it.

Once we have this higher level logging structure (while still preserving
the fine details as well), we can go a step forward and attach things
like the ability to panic the box to individual events.

[ Note, we might also still keep a 'generic' event like mce_record as
well, if that still makes sense once we've split up the events
properly. ]

Then the next step would be clean and generic event injection support
that uses perf events.

Hm? Looks like pretty exciting stuff to me - there's a _lot_ of
expressive potential in the hardware, we have myriads of interesting
details that can be logged - we just need to free it up and make it
available properly.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/