Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

From: Song Liu
Date: Mon Mar 15 2021 - 03:52:10 EST




> On Mar 14, 2021, at 3:48 PM, Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
>
> On Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu wrote:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>>
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>>
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>>
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>>
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>>
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>
> isn't that another filter via bpf_get_current_cgroup_id ?

This is tricky with nested cgroups. We also have bpf_get_current_ancestor_cgroup_id,
but that's not the exact same we need either. We may need a new helper.

Also, we are limited by 38 follower programs for the leader program, which
might be too few for Namhyung's use case. We can use some logic in the
follower program to count events for many cgroups in one fexit program.

>
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>
> we'd need to call the leadr on fentry/fexit of the monitored bpf
> program, right?

My current plan is to let perf-stat -b share the same perf_event map with
bperf, but not the leader program.

>
>> 3. Do not support event groups.
>
> I guess for group support you'll need to load 'leaders' for each group member

I am not sure how this would work. Say the user started the following in
parallel:

#1 perf stat --bpf-counters -e cycles -a &
#2 perf stat --bpf-counters -e instructions -C 1,2,3 &
#3 perf stat --bpf-counters -e {cycles,instructions} -p 123

Event "cycles" is loaded by #1; while event "instruction" is loaded by #2.
If #3 reuses these events, it is tricky (or impossible?) to make sure the
event group works as expected, right?

>
>>
>> The following commands have been tested:
>>
>> perf stat --use-bpf -e cycles -a
>> perf stat --use-bpf -e cycles -C 1,3,4
>> perf stat --use-bpf -e cycles -p 123
>> perf stat --use-bpf -e cycles -t 100,101
>
> works good with that file you sent.. I'll check/test more,
> so far some quick comments below
>
> thanks,
> jirka
>
>
>
> SNIP
>
>> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
>> #ifdef HAVE_BPF_SKEL
>> OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
>> "stat events on existing bpf program id"),
>> + OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
>> + "use bpf program to count events"),
>> + OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
>> + "path to perf_event_attr map"),
>
> what's the point of allowing another name? just debug purpose?

It is mostly to cover corner cases, like something else used the same
name.

>
> SNIP
>
>> + * bperf uses a hashmap, the attr_map, to track all the leader programs.
>> + * The hashmap is pinned in bpffs. flock() on this file is used to ensure
>> + * no concurrent access to the attr_map. The key of attr_map is struct
>> + * perf_event_attr, and the value is struct bperf_attr_map_entry.
>> + *
>> + * struct bperf_attr_map_entry contains two __u32 IDs, bpf_link of the
>> + * leader prog, and the diff_map. Each perf-stat session holds a reference
>> + * to the bpf_link to make sure the leader prog is attached to sched_switch
>> + * tracepoint.
>> + *
>> + * Since the hashmap only contains IDs of the bpf_link and diff_map, it
>> + * does not hold any references to the leader program. Once all perf-stat
>> + * sessions of these events exit, the leader prog, its maps, and the
>> + * perf_events will be freed.
>> + */
>> +struct bperf_attr_map_entry {
>> + __u32 link_id;
>> + __u32 diff_map_id;
>> +};
>> +
>> +#define DEFAULT_ATTR_MAP_PATH "/sys/fs/bpf/bperf_attr_map"
>
> we should make that with sysfs__mountpoint

Sounds great!

>
> SNIP
>
>> +static int bperf_reload_leader_program(struct evsel *evsel, int attr_map_fd,
>> + struct bperf_attr_map_entry *entry)
>> +{
>> + struct bperf_leader_bpf *skel = NULL;
>> + struct perf_cpu_map *all_cpus;
>> + int link_fd, diff_map_fd, err;
>> + struct bpf_link *link = NULL;
>> +
>> + skel = bperf_leader_bpf__open();
>> + if (!skel) {
>> + pr_err("Failed to open leader skeleton\n");
>> + return -1;
>> + }
>> +
>> + bpf_map__resize(skel->maps.events, libbpf_num_possible_cpus());
>> + err = bperf_leader_bpf__load(skel);
>> + if (err) {
>> + pr_err("Failed to load leader skeleton\n");
>> + goto out;
>> + }
>> +
>> + err = -1;
>> + link = bpf_program__attach(skel->progs.on_switch);
>> + if (!link) {
>> + pr_err("Failed to attach leader program\n");
>> + goto out;
>> + }
>> +
>> + all_cpus = perf_cpu_map__new(NULL);
>> + if (!all_cpus) {
>> + pr_err("Failed to open cpu_map for all cpus\n");
>> + goto out;
>> + }
>> +
>> + link_fd = bpf_link__fd(link);
>> + diff_map_fd = bpf_map__fd(skel->maps.diff_readings);
>> + entry->link_id = bpf_link_get_id(link_fd);
>> + entry->diff_map_id = bpf_map_get_id(diff_map_fd);
>> + err = bpf_map_update_elem(attr_map_fd, &evsel->core.attr, entry, BPF_ANY);
>> + assert(err == 0);
>> +
>> + evsel->bperf_leader_link_fd = bpf_link_get_fd_by_id(entry->link_id);
>> + assert(evsel->bperf_leader_link_fd >= 0);
>> +
>> + /* save leader_skel for install_pe */
>
> please add to the comment: ', which is called within following evsel__open_per_cpu call'
> or something like that..

Will add it in v2.

>
> SNIP
>
>> +static int bperf_sync_counters(struct evsel *evsel)
>> +{
>> + struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
>> + int num_cpu, i, cpu;
>> +
>> + if (!all_cpus)
>> + return -1;
>> +
>> + num_cpu = all_cpus->nr;
>> + for (i = 0; i < num_cpu; i++) {
>> + cpu = all_cpus->map[i];
>> + bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
>> + }
>> + return 0;
>> +}
>> +
>> +static int bperf__enable(struct evsel *evsel)
>> +{
>> + struct bperf_follower_bpf *skel = evsel->follower_skel;
>> + __u32 num_cpu_bpf = libbpf_num_possible_cpus();
>
> we have cpu__max_cpu for that

libbpf calls for percpu array use libbpf_num_possible_cpus. So I guess it
is better to use the same here. The two are identical at the moment though.

>
>> + struct bpf_perf_event_value values[num_cpu_bpf];
>> + int err, i, entry_cnt;
>> +
>> + err = bperf_follower_bpf__attach(evsel->follower_skel);
>> + if (err)
>> + return -1;
>
> could we attach it in load callback and have the some map
> value be checked in follower program and 'if value != 0'
> it would let the program to continue.. I used/saw this
> in few bcc programs

Yeah, that's another way to eliminate the __attach() latency. Let me try it.

>
>> +
>> + /*
>> + * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
>> + * with some debug options enabled). This shows as a longer first
>> + * interval:
>> + *
>> + * # perf stat -e cycles -a -I 1000
>> + * # time counts unit events
>> + * 1.267634674 26,259,166,523 cycles
>> + * 2.271637827 22,550,822,286 cycles
>> + * 3.275406553 22,852,583,744 cycles
>> + *
>> + * Fix this by zeroing accum_readings after attaching the program.
>> + */
>> + bperf_sync_counters(evsel);
>> + entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
>> + memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
>> +
>> + for (i = 0; i < entry_cnt; i++) {
>> + bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
>> + &i, values, BPF_ANY);
>> + }
>> + return 0;
>> +}

[...]