Re: Tracing Requirements (was: [RFC/Requirements/Design] h/w errorreporting)

From: Mathieu Desnoyers
Date: Wed Nov 10 2010 - 15:23:28 EST


* Frederic Weisbecker (fweisbec@xxxxxxxxx) wrote:
> On Wed, Nov 10, 2010 at 02:00:45PM -0500, Steven Rostedt wrote:
> > On Wed, 2010-11-10 at 19:41 +0100, Ingo Molnar wrote:
> >
> > > We'll need to embark on this incremental path instead of a rewrite-the-world thing.
> > > As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can
> > > and will do better here.
> >
> > Thus you are saying that we stick to the status quo, and also ignore the
> > fact that perf was a rewrite-the-world from ftrace to begin with.
>
> Perhaps you and Mathieu can summarize your requirements here and then explain
> why extending the current ABI wouldn't work. It's quite normal that people
> try to find a solution fully backward compatible in the first place. If
> it's not possible, fine, but then justify it.

Sure, here are the requirements my user-base have, followed by a listing of Perf
and Ftrace pain points, some of which are directly derived from their respective
ABIs, others partially caused by their implementation and partially caused by
their ABI.

- Low overhead is key
- 150 ns per event (cache-hot)
- Zero-copy (splice to disk/network, mmap for zero-copy in-place data
analysis)
- Compactness of traces
- e.g. 96 bits per event (including typical 64-bit payload), no PID saved per
event.
- Scalability to multi-core and multi-processor
- Per-CPU buffers, time-stamp reading both scalable to many cpus *and* accurate
- Production-grace tracer reliability
- Trace clock accuracy within 100ns, ordering can be inferred based on
lock/interrupt handler knowledge, ability to know when ordering might be
wrong.
- Flight recorder mode
- Support concurrent read while writer is overwriting buffer data
(Thomas Gleixner named these "trace-shots")
- Support multiple trace sessions in parallel
- Engineer + Operator + flight recorder for automated bug reports
- Availability of trace buffers for crash diagnosis
- Save to disk, network, use kexec or persistent memory
- Heterogeneous environment support
- Portability
- Distinct host/target environment support
- Management of multiple target kernel versions
- No dependency on kernel image to analyze traces
(traces contain complete information)
- Live view/analysis of trace streams via the network
- Impact on buffer flushing, power saving, idle, ...
- Synchronized system-wide (hypervisor, kernel and user-space) traces
- Scalability of analysis tools to very large data sets (> 10GB)
- Standardization of trace format across analysis tools


* Ring Buffer issues with Perf:

- Perf does not support flight recorder tracing (concurrent read/write)
- Sub-buffers are needed to support concurrent read/writes in flight recorder
mode. Peter still has to convince me otherwise (if he cares).
- Imply adding padding when an event does not fit in the current sub-buffer
(ABI change). Note for Frederic: creating a single-subbuffer as large as the
buffer does not solve this problem, because perf allows writing an event
across the end of the buffer and its beginning. In a scheme where
sub-buffers can be discarded, it makes it quite unreliable to try to figure
out where partially overwritten events end.
- Calling the kernel when finishing reading a sub-buffer is needed for flight
recorder mode tracing. It is not possible with the mmap-head-tail-counter
ABI Perf currently uses for reader-writer synchronization.
- Perf is 5 times slower than Ftrace/Generic Ring Buffer Library/LTTng.
- Partially due to implementation.
- Partially due to large event size.

* Trace Format issues with Perf:

- Perf event headers are too large
- Handling of dynamically added instrumentation while trace is recorded is
inexistent.


* Ring Buffer issues with Ftrace:

- Ftrace needs an internal API cleanup.
- "peek" is an unnecessary API duplication which complicates everything down
to the buffer-level.
- Ftrace does not support cross-pages event writes
- Limits event size to less than 4kB

* Trace Format issues with Ftrace:

- Ftrace timestamps are saved as delta from previous event
- Only works for tracing where preemption can be disabled, unusable for
user-space tracing.
- Creates an artificial data dependency between events, leading to odd
side-effects when dealing with nesting over tracer
- 0 ns IRQ/SOFTIRQ handler duration side-effect
- Event size limited to one page
- Ftrace event headers are still too large
- Handling of dynamically added instrumentation while trace is recorded is
inexistent.

So given that fixing these issues requires a large ABI rework of both Ftrace and
Perf, creating a new ABI rather than building on top of an ABI not initially
designed to meet these requirements seems to really make sense here.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/