Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

From: Song Liu
Date: Fri Mar 12 2021 - 13:53:44 EST




> On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo <acme@xxxxxxxxxx> wrote:
>
> Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>>
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>>
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>>
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>>
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
>
> Cool stuff, but I think you can break this up into more self contained
> patches, see below.
>
> Apart from that, some suggestions/requests:
>
> We need a shell 'perf test' that uses some synthetic workload so that we
> can count events with/without --use-bpf (--bpf-counters is my
> alternative name, see below), and then compare if the difference is
> under some acceptable range.
>
> As a followup patch we could have something like:
>
> perf config stat.bpf-counters=yes
>
> That would make 'perf stat' use BPF counters for what it can, using the
> default method for the non-supported targets, emitting some 'perf stat
> -v' visible warning (i.e. a debug message), i.e. make it opt-in that the
> user wants to use BPF counters for all that is possible at that point in
> time.o
>
> Thanks for working on this,
>
> - Arnaldo
>
>> The following commands have been tested:
>>
>> perf stat --use-bpf -e cycles -a
>> perf stat --use-bpf -e cycles -C 1,3,4
>> perf stat --use-bpf -e cycles -p 123
>> perf stat --use-bpf -e cycles -t 100,101
>>
>> Signed-off-by: Song Liu <songliubraving@xxxxxx>
>> ---
>> tools/perf/Makefile.perf | 1 +
>> tools/perf/builtin-stat.c | 20 +-
>> tools/perf/util/bpf_counter.c | 552 +++++++++++++++++-
>> tools/perf/util/bpf_skel/bperf.h | 14 +
>> tools/perf/util/bpf_skel/bperf_follower.bpf.c | 65 +++
>> tools/perf/util/bpf_skel/bperf_leader.bpf.c | 46 ++
>> tools/perf/util/evsel.h | 20 +-
>> tools/perf/util/target.h | 4 +-
>> 8 files changed, 712 insertions(+), 10 deletions(-)
>> create mode 100644 tools/perf/util/bpf_skel/bperf.h
>> create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>> create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>>
>> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
>> index f6e609673de2b..ca9aa08e85a1f 100644
>> --- a/tools/perf/Makefile.perf
>> +++ b/tools/perf/Makefile.perf
>> @@ -1007,6 +1007,7 @@ python-clean:
>> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>> SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>> +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>>
>> ifdef BUILD_BPF_SKEL
>> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 2e2e4a8345ea2..34df713a8eea9 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> }
>>
>> evlist__for_each_cpu (evsel_list, i, cpu) {
>> + /*
>> + * bperf calls evsel__open_per_cpu() in bperf__load(), so
>> + * no need to call it again here.
>> + */
>> + if (target.use_bpf)
>> + break;
>> affinity__set(&affinity, cpu);
>>
>> evlist__for_each_entry(evsel_list, counter) {
>> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> /*
>> * Enable counters and exec the command:
>> */
>> - t0 = rdclock();
>> - clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> -
>> if (forks) {
>> evlist__start_workload(evsel_list);
>> err = enable_counters();
>> if (err)
>> return -1;
>>
>> + t0 = rdclock();
>> + clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
>> status = dispatch_events(forks, timeout, interval, &times);
>> if (child_pid != -1) {
>> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> err = enable_counters();
>> if (err)
>> return -1;
>> +
>> + t0 = rdclock();
>> + clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> status = dispatch_events(forks, timeout, interval, &times);
>> }
>>
>
> The above two hunks seems out of place, i.e. can they go to a different
> patch and with an explanation about why this is needed?

Actually, I am still debating whether we want the above change in a separate
patch. It is related to the following change.

[...]

>> + /*
>> + * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
>> + * with some debug options enabled). This shows as a longer first
>> + * interval:
>> + *
>> + * # perf stat -e cycles -a -I 1000
>> + * # time counts unit events
>> + * 1.267634674 26,259,166,523 cycles
>> + * 2.271637827 22,550,822,286 cycles
>> + * 3.275406553 22,852,583,744 cycles
>> + *
>> + * Fix this by zeroing accum_readings after attaching the program.
>> + */
>> + bperf_sync_counters(evsel);
>> + entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
>> + memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
>> +
>> + for (i = 0; i < entry_cnt; i++) {
>> + bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
>> + &i, values, BPF_ANY);
>> + }
>> + return 0;
>> +}

Attaching the skeleton takes non-trivial time, so that we get a bigger first
interval (1.26s in the example above). To fix this, in __run_perf_stat(), we
get t0 and ref_time after enable_counters().

Maybe a comment in __run_perf_stat() is better than a separate patch?

Thanks,
Song