Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

From: Song Liu
Date: Fri Mar 12 2021 - 10:39:45 EST




> On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> Hi,
>
> On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@xxxxxx> wrote:
>>
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>>
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>>
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>
> Interesting! Actually I thought about something similar before,
> but my BPF knowledge is outdated. So I need to catch up but
> failed to have some time for it so far. ;-)
>
>>
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>>
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
>
> In my case, per cgroup event counting is very important.
> And I'd like to do that with lots of cpus and cgroups.

We can easily extend this approach to support cgroups events. I didn't
implement it to keep the first version simple.

> So I'm working on an in-kernel solution (without BPF),
> I hope to share it soon.

This is interesting! I cannot wait to see how it looks like. I spent
quite some time try to enable in kernel sharing (not just cgroup
events), but finally decided to try BPF approach.

>
> And for event groups, it seems the current implementation
> cannot handle more than one event (not even in a group).
> That could be a serious limitation..

It supports multiple events. Multiple events are independent, i.e.,
"cycles" and "instructions" would use two independent leader programs.

>
>>
>> The following commands have been tested:
>>
>> perf stat --use-bpf -e cycles -a
>> perf stat --use-bpf -e cycles -C 1,3,4
>> perf stat --use-bpf -e cycles -p 123
>> perf stat --use-bpf -e cycles -t 100,101
>
> Hmm... so it loads both leader and follower programs if needed, right?
> Does it support multiple followers with different targets at the same time?

Yes, the whole idea is to have one leader program and multiple follower
programs. If we only run one of these commands at a time, it will load
one leader and one follower. If we run multiple of them in parallel,
they will share the same leader program and load multiple follower
programs.

I actually tested more than the commands above. The list actually means
we support -a, -C -p, and -t.

Currently, this works for multiple events, and different parallel
perf-stat. The two commands below will work well in parallel:

perf stat --use-bpf -e ref-cycles,instructions -a
perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5

Note the use of ref-cycles, which can only use one counter on Intel CPUs.
With this approach, the above two commands will not do time multiplexing
on ref-cycles.

Thanks,
Song