Re: [PATCH] perf record: Add snapshot mode support for perf's regular events

From: Ingo Molnar
Date: Thu Nov 26 2015 - 04:41:11 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

>
> * Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> > > But yes, we can do that userspace ring buffer when we really need it. At
> > > very first we can start working on perf side and assume overwrite mode is
> > > ready.
> >
> > I don't think Peter asked for much: pick up the patch he has already written
> > and use it, to have an even lower overhead always-enabled background tracing
> > mode of perf.
> >
> > Resizing shouldn't be much of an issue with existing features: if events start
> > overflowing or some other threshold for dynamic increase of the ring-buffer is
> > met then the daemon should open a new set of events with a larger ring-buffer,
> > and close the old events once the new tracing ring-buffer is up and running.
> >
> > Use event multiplexing to output all interesting events into the same single
> > (per CPU) ring-buffer.
>
> Btw., there's another trick we could use to support ftrace-alike workflows even
> better: we could expose a task's active perf ring-buffers under /proc/<PID>/ and
> could make it readable.
>
> So if an overwrite-mode background tracing session is running, you don't even
> have to signal it to capture the ring-buffer: just open the ring-buffer fd in
> procfs, under /proc/XYZ/perf/ring-buffers/5.trace or so, and dump its current
> contents, assuming the task doing that has sufficient permissions - i.e.
> ptrace_may_access().
>
> We could even pretty-print some very basic version of the records from the
> kernel, via /proc/XYZ/perf/ring-buffers/5.txt, to support a tooling-less tracing
> modes. This way perf based tracing could be supported even on systems that have
> no writable filesystems.
>
> I.e. in this regard perf can be made to match ftrace's tracing workflow as well
> - in addition to the more traditional perf profiling workflow we all love and
> know!

Also note that if we go in this direction then with some additional changes we
could also support lightweight tracing with no tooling side at all on the traced
system: a simple kernel feature with a kernel thread could be added that takes a
list of events from sysfs or debugfs and opens them system-wide and exposes
per-cpu overwrite mode ring-buffers.

Those ring-buffers can then be accessed via procfs (and/or also be exposed in
parallel via debugfs). The kernel thread never actually does anything except set
up the events - i.e. this is a very lightweight mode of always-on tracing.

Additional debugfs toggles can be added to temporarily turn tracing on/off without
closing the events - just like ftrace.

Other toggles could be added, such as: 'stop tracing when the kernel has crashed,
or if a specific event has occured or a condition has been met'.

That way we could, among other things, capture traces on embedded systems and copy
the traces to another, larger system (or NFS-mount the target system), and run
perf tooling to analyze the traces on that more powerful system.

But it all starts with making overwrite mode work well, and working with the
kernel visible ring-buffer. That can then be exposed to user-space in very
expressive ways to turn perf into a flexible system tracing subsystem as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/