Re: [PATCH 2/2] x86, mce: Add persistent MCE event

From: Ingo Molnar
Date: Fri May 18 2012 - 04:18:44 EST



* Borislav Petkov <bp@xxxxxxxxx> wrote:

> On Sat, Mar 24, 2012 at 10:15:01AM +0100, Ingo Molnar wrote:
> > * Borislav Petkov <bp@xxxxxxxxx> wrote:
> >
> > > On Sat, Mar 24, 2012 at 08:37:31AM +0100, Ingo Molnar wrote:
> > > > I was mainly thinking of reducing this:
> > > >
> > > > arch/x86/kernel/cpu/mcheck/mce.c | 53 ++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 53 insertions(+)
> > > >
> > > > to almost nothing. There doesn't seem to be much MCE specific in
> > > > that code, right?
> > >
> > > Yeah, this could be generalized even more, AFAICT.
> > >
> > > >
> > > > > Btw, the more important question is are we going to need
> > > > > persistent events that much so that a generic approach is
> > > > > warranted? I guess maybe the black box events recording deal
> > > > > would be another user..
> > > >
> > > > So, here's the big picture as I see it:
> > > >
> > > > I think tracing could use persistent events: mark all the events
> > > > we want to trace as persistent from bootup, and recover the
> > > > bootup trace after the system has been booted up.
> > >
> > > Right, but (more nasty questions):
> > >
> > > Why would I do this, am I tracing the boot process? [...]
> >
> > Correct, in essence the MCE persistent event is partially about
> > that: we are starting to collect events well before there's any
> > user-space available.
> >
> > > [...] If so, then I need another syntax which enables those
> > > events from the kernel command line which gets parsed the
> > > moment ftrace and ring buffer get initialized.
> >
> > Correct. Something really simple like:
> >
> > boot_trace=<event1>,<event2>...
> >
> > ... which could be all implicit within MCE too. (So I'm not
> > suggesting some boot command trigger to provide the MCE case -
> > but for more general boot tracing it would be the right
> > solution.)
> >
> > > IOW, I'd need userspace for perf otherwise but I don't have
> > > that before booting...
> >
> > Correct. In the case of MCE there's no "userspace" really needed
> > - we just want to trace early enough. This model carries over to
> > later as well: there's no *specific* process we want to attach
> > the trace buffer to - we just want a persistent trace buffer
> > that essentially never loses MCE events.
> >
> > > Then, after having booted, do I stop the trace? If no, then I
> > > can see the persistency in there so are you saying we want a
> > > low overhead, low ressource utilization machinery which runs
> > > all the time and traces the system? What are possible real
> > > life use cases for that? Scheduler analysis probably,
> > > long-term tracing of some stuff people are interested in how
> > > it behaves over long periods of time... MCE is one use case,
> > > definitely...
> >
> > Boot tracing is a very real usecase, people use it to reduce
> > boot times. Today printk timestamps are used as a substitute.
> > (There's also a boot tracer plugin within ftrace, see the
> > bootup_tracer.)
> >
> > > > But other, runtime models of tracing could use it as well:
> > > > basically the main difference that ftrace has to perf based
> > > > tracing today is a system-wide persistent buffer with no
> > > > particular owning process. (The rest is mostly UI and
> > > > analysis features and scope of tracing differences, and of
> > > > course a lot more love and detail went into ftrace so far.)
> > > >
> > > > So MCE will in the end be just a minor user of such a
> > > > facility - I think you should aim for enabling *any* set of
> > > > events to have persistent recording properties, and add the
> > > > APIs to recover that information sanely. It should also be
> > > > possible for them to record into a shared mmap page in
> > > > essence - instead of having per event persistent buffers.
> > >
> > > Sounds like ftrace. But we have that already, we only need to
> > > get to using it perf-side, no...? [...]
> >
> > What we want is to extend the perf ring-buffer to be persistent
> > *as well*. It's an evidently useful model of collecting events.
> >
> > All the remaining perf tooling can be used after that point - if
> > it's a bog-standard perf ring-buffer then it can be saved into a
> > perf.data and can be analyzed in a rich fashion, etc.
> >
> > Think about it: for example we could do not just boot tracing
> > but also boot *profiling*, by using the PMU to sample into a
> > persistent buffer which after bootup can be put into a perf.data
> > and 'perf report' will do the right thing, etc...
> >
> > Does it overlap with ftrace? Perf overlapped with ftrace from
> > day one on and it's starting to become a maintenance problem: we
> > want to remove that overlap not by keeping two separate entities
> > (both of which suck and rule in their own ways) but having a
> > unified facility.
>
> Leaving all of the above for reference.
>
> So, I spent some more nights sleeping on it :-)
>
> Here's what I dreamt of:
>
> * The last thing perf_event_init() does is init the persistent, per-cpu
> buffers.
>
> * there's no need for changing TRACE_EVENT: "boot_trace" parameter
> parsing code enables those events the moment perf is initialized. We're
> doing this anyway because we're enabling the trace_mce_record TP.
>
> It sounds pretty simple to me but the devil is in the details,
> especially making the persistent buffers, task-agnostic and generic
> enough.
>
> Ingo, Peter, thoughts?

Sounds good to me in principle - I guess if you send something
that is tested, works, and also enables boot tracing we can see
the details?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/