[RFD] Future tracing/instrumentation directions

From: Ingo Molnar
Date: Thu May 20 2010 - 05:32:29 EST

* Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> More than a year and a half ago (September 2008), at
> Linux Plumbers, we had a meeting with several kernel
> developers to come up with a unified ring buffer. A
> generic ring buffer in the kernel that any subsystem
> could use. After coming up with a set of requirements, I
> worked on implementing it. One of the requirements was
> to start off simple and work to become a more complete
> buffering system.
> [...]

The thing is, in tracing land and more broadly in
instrumentation land we have _much_ more earthly problems
these days:

- Lets face it, performance of the ring-buffer sucks, in
a big way. I've recently benchmarked it and it takes
hundreds of instructions to trace a single event.
Puh-lease ...

- It has grown a lot of slack. It's complex and hard to

- Over the last year or so the majority of bleeding-edge
tracing developers have gradually migrated over to perf
and 'perf trace' / 'perf probe' in particular. As far
as the past two merge windows go they are
out-developing the old ftrace UIs by a ratio of 4:1.

So this angle is becoming a prime thing to improve and
users and developers are hurting from the ftrace/perf

- [ While it's still a long way off, if this trend continues
we eventually might even be able to get rid of the
/debug/tracing/ temporary debug API and get rid of
the ugly in-kernel pretty-printing bits. This is
good: it may make Andrew very happy for a change ;-)

The main detail here to be careful of is that lots of
people are fond of the simplicity of the
/debug/tracing/ debug UI, so when we replace it we
want to do it by keeping that simple workflow (or
best by making it even simpler). I have a few ideas
how to do this.

There's also the detail that in some cases we want to
print events in the kernel in a human readable way:
for example EDAC/MCE and other critical events,
trace-on-oops, etc. This too can be solved. ]

Regarding performance and complexity, which is our main
worry atm, fortunately there's work going on in that
direction - please see PeterZ's recent string of patches
on lkml:

4f41c01: perf/ftrace: Optimize perf/tracepoint interaction for single events
a19d35c: perf: Optimize buffer placement by allocating buffers NUMA aware
ef60777: perf: Optimize the perf_output() path by removing IRQ-disables
fa58815: perf: Optimize the hotpath by converting the perf output buffer to local_t
6d1acfd: perf: Optimize perf_output_*() by avoiding local_xchg()

And it may sound harsh but at this stage i'm personally
not at all interested in big design talk. This isnt rocket
science, we have developers and users and we know what
they are doing and we know what we need to do: we need to
improve our crap and we need to reduce complexity. Less is

So i'd like to see iterative, useful action first, and i
am somewhat sceptical about yet another grand tracing
design trying to match 100 requirements.

Steve, Mathieu, if you are interested please help out
Peter with the performance and simplification work. The
last thing we need is yet another replace-everything

If we really want to create a new ring-buffer abstraction
i'd suggest we start with Peter's, it has a quite sane
design and stayed simple and flexible - if then it could
be factored out a bit.

Here are more bits of what i see as the 'action' going
forward, in no particular order:

1) Push the /debug/tracing/events/ event description
into sysfs, as per this thread on lkml:

[RFC][PATCH v2 06/11] perf: core, export pmus via sysfs


I.e. one more step towards integrating ftrace into perf.

2) Use 1) to unify the perf events and the ftrace
ring-buffer. This, as things are standing is
best done by factoring out Peter's ring-buffer
in kernel/perf_event.c. It's properly abstracted
and it _far_ simpler than kernel/tracing/ring_buffer.c,
which has become a monstrosity.

(but i'm open to other simplifications as well)

3) Add the function-tracer and function-graph tracer
as an event and integrate it into perf.

This will live-test the efficiency of the unification
and brings over the last big ftrace plugin to perf.

4) Gradually convert/port/migrate all the remaining
plugins over as well. We need to do this very gently
because there are users - but stop piling new
functionality on to the old ftrace side. This usually

- Conversion of an explicit tracing callback to
TRACE_EVENT (for example in the case of mmiotrace),
while keeping all tool functionality.

- Migrate any 'special' ftrace feature to perf
capabilities so that it's available via the
syscall interface as well. (for example
'latency maximum tracking' is something that we
probably want to do with kernel-side help - we
probably dont want to implement it via tracing
everything all the time and finding the maximum
based on terabytes of data.)

(And there are other complications here too, but you
get the idea.)

All in one, i think we can reuse more than 50% of all
current ftrace code (possibly up to 70-80%) - and we are
already reusing bits of it.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/