Re: [RFC PATCH 1/3] perf: Add persistent events

From: Robert Richter
Date: Sat Apr 06 2013 - 11:53:31 EST


Boris,

On 15.03.13 14:06:27, Borislav Petkov wrote:
> Add the needed pieces for persistent events which makes them
> process-agnostic. Also, make their buffers read-only when mmaping them
> from userspace.

> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 9fa9c622a7f4..4a4ae56195e1 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -270,8 +270,9 @@ struct perf_event_attr {
>
> exclude_callchain_kernel : 1, /* exclude kernel callchains */
> exclude_callchain_user : 1, /* exclude user callchains */
> + persistent : 1, /* always-on event */

based on the discussion we have had a couple of days ago I got the
idea that we do not necessarily need the persistent flag. The purpose
of the persistent flag is to indicate that an existing event buffer
that is already enabled should be used. This means the buffer is
shared by multiple consumers. Enabled events with the following
conditions allow this too:

* buffer mappings are readonly,
* event is systemwide,
* struct perf_event_attr is identical.

This means that an event with readonly buffers (PROT_WRITE unset,
over-write unread data) and the same struct perf_event_attr can be
simply reused. So, we don't need the flag: When setting up a
systemwide and readonly event we just check if there is already one
event running. If so we increment the use count of that event and
share its buffer, otherwise we create a new event.

This also has the advantage that the syscall i/f does not change. We
simply setup the event in the same way as we do already. Maybe the
only thing we need is a sysfs extension to report existing systemwide
events to userland.

Now, to have persistent events in the sense that the event is running
since system startup, we simply enable an in-kernel-counter. If
userland later sets up the same event, the kernel is aware of the
already running counter and redirects it to its buffer and all
(persistent) samples of the current buffer are now available to
userland.

The above also has the advantage of sharing systemwide events which
saves system resources, e.g. hardware counters. Suppose two processes
are collecting cpu-clocks, currently there are two hw counters and
buffers used for this, but only one counter would be sufficient.

I currently see the following problem with the approach. If we map to
an already running counter, there might be samples in the buffer that
had been collected before the syscall was setup. But the profiling
tool might be interested only in newer samples. So userland must be
aware of this and might be able to drop those samples. This should be
also backward compatible with the existing syscall i/f.

Another problem is different buffer size. I don't know if it is
possible to simply resize mmap'ed regions to the biggest buffer
requested. Though this problem already exist with your code.

Maybe it's worth to look at this approach. I need to review more perf
code for this.

-Robert
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/