Re: [patch] Performance Counters for Linux, v2

From: Paul Mackerras
Date: Sun Dec 07 2008 - 22:25:27 EST

Ingo Molnar writes:

> There's a new "counter group record" facility that is a straightforward
> extension of the existing "irq record" notification type. This record
> type can be set on a 'master' counter, and if the master counter triggers
> an IRQ or an NMI, all the 'secondary' counters are read out atomically
> and are put into the counter-group record. The result can then be read()
> out by userspace via a single system call. (Based on extensive feedback
> from Paul Mackerras and David Miller, thanks guys!)
> The other big change is the support of virtual task counters via counter
> scheduling: a task can specify more counters than there are on the CPU,
> the kernel will then schedule the counters periodically to spread out hw
> resources.

Still not good enough, I'm sorry.

* I have no guarantee that the secondary counters were all counting
at the same time(s) as the master counter, so the numbers are
virtually useless.

* I might legitimately want to be notified based on any of the
"secondary" counters reaching particular values. The "master"
vs. "secondary" distinction is an artificial one that is going to
make certain reasonable use-cases impossible.

These things are both symptoms of the fact that you still have the
abstraction at the wrong level. The basic abstraction really needs to
be a counter-set, not an individual counter.

I think your patch can be extended to do counter-sets without
complicating the interface too much. We could have:

struct event_spec {
u32 hw_event_type;
u32 hw_event_period;
u64 hw_raw_ctrl;

int perf_counterset_open(u32 n_counters,
struct event_spec *counters,
u32 record_type,
pid_t pid,
int cpu);

and then you could have perf_counter_open as a simple wrapper around

With an approach like this we can also provide an "exclusive" mode for
the PMU (e.g. with a flag bit in record_type or n_counters), which
means that the counter-set occupies the whole PMU. That will give a
way for userspace to specify all the details of how the PMU is to be
programmed, which in turn means that the kernel doesn't need to know
all the arcane details of every event on every processor; it just
needs to know the common events.

I notice the implementation also still assumes it can add any counter
at any time subject only to a limit on the number of counters in use.
That will have to be fixed before it is usable on powerpc (and
apparently on some x86 processors too).

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at