Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

From: Song Liu
Date: Sat Mar 13 2021 - 14:38:53 EST




> On Mar 12, 2021, at 6:47 PM, Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>
> On Sat, Mar 13, 2021 at 12:38 AM Song Liu <songliubraving@xxxxxx> wrote:
>>
>>
>>
>>> On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@xxxxxx> wrote:
>>>>
>>>> perf uses performance monitoring counters (PMCs) to monitor system
>>>> performance. The PMCs are limited hardware resources. For example,
>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>>
>>>> Modern data center systems use these PMCs in many different ways:
>>>> system level monitoring, (maybe nested) container level monitoring, per
>>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>>> there are more active perf_events than available hardware PMCs. To allow
>>>> all perf_events to have a chance to run, it is necessary to do expensive
>>>> time multiplexing of events.
>>>>
>>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>>> instructions). It is a waste to have multiple tools create multiple
>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>>
>>>> bperf tries to reduce such wastes by allowing multiple perf_events of
>>>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>>>> of having each perf-stat session to read its own perf_events, bperf uses
>>>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>>>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>>>
>>>> Please refer to the comment before the definition of bperf_ops for the
>>>> description of bperf architecture.
>>>
>>> Interesting! Actually I thought about something similar before,
>>> but my BPF knowledge is outdated. So I need to catch up but
>>> failed to have some time for it so far. ;-)
>>>
>>>>
>>>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>>>> bperf uses a BPF hashmap to share information about BPF programs and maps
>>>> used by bperf. This map is pinned to bpffs. The default address is
>>>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>>>> --attr-map.
>>>>
>>>> ---
>>>> Known limitations:
>>>> 1. Do not support per cgroup events;
>>>> 2. Do not support monitoring of BPF program (perf-stat -b);
>>>> 3. Do not support event groups.
>>>
>>> In my case, per cgroup event counting is very important.
>>> And I'd like to do that with lots of cpus and cgroups.
>>
>> We can easily extend this approach to support cgroups events. I didn't
>> implement it to keep the first version simple.
>
> OK.
>
>>
>>> So I'm working on an in-kernel solution (without BPF),
>>> I hope to share it soon.
>>
>> This is interesting! I cannot wait to see how it looks like. I spent
>> quite some time try to enable in kernel sharing (not just cgroup
>> events), but finally decided to try BPF approach.
>
> Well I found it hard to support generic event sharing that works
> for all use cases. So I'm focusing on the per cgroup case only.
>
>>
>>>
>>> And for event groups, it seems the current implementation
>>> cannot handle more than one event (not even in a group).
>>> That could be a serious limitation..
>>
>> It supports multiple events. Multiple events are independent, i.e.,
>> "cycles" and "instructions" would use two independent leader programs.
>
> OK, then do you need multiple bperf_attr_maps? Does it work
> for an arbitrary number of events?

The bperf_attr_map (or perf_attr_map) is shared among different events.
It is a hash map with perf_event_attr as the key. Currently, I hard coded
its size to 16. We can introduce more flexible management of this map.

Thanks,
Song