Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

From: Jin, Yao
Date: Thu May 21 2020 - 00:38:16 EST


Hi Jiri,

On 5/20/2020 3:50 PM, Jiri Olsa wrote:
On Wed, May 20, 2020 at 01:36:40PM +0800, Jin, Yao wrote:
Hi Jiri,

On 5/18/2020 11:28 AM, Jin, Yao wrote:
Hi Jiri,

On 5/15/2020 4:33 PM, Jiri Olsa wrote:
On Fri, May 15, 2020 at 02:04:57PM +0800, Jin, Yao wrote:

SNIP

I think I get the root cause. That should be a serious bug in get_group_fd, access violation!

For a group mixed with system-wide event and per-core event and the group
leader is system-wide event, access violation will happen.

perf_evsel__alloc_fd allocates one FD member for system-wide event (only FD(evsel, 0, 0) is valid).

But for per core event, perf_evsel__alloc_fd allocates N FD members (N =
ncpus). For example, for ncpus is 8, FD(evsel, 0, 0) to FD(evsel, 7, 0) are
valid.

get_group_fd(struct evsel *evsel, int cpu, int thread)
{
ÂÂÂÂ struct evsel *leader = evsel->leader;

ÂÂÂÂ fd = FD(leader, cpu, thread);ÂÂÂ /* access violation may happen here */
}

If leader is system-wide event, only the FD(leader, 0, 0) is valid.

When get_group_fd accesses FD(leader, 1, 0), access violation happens.

My fix is:

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 28683b0eb738..db05b8a1e1a8 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1440,6 +1440,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
ÂÂÂÂÂÂÂÂ if (evsel__is_group_leader(evsel))
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return -1;

+ÂÂÂÂÂÂ if (leader->core.system_wide && !evsel->core.system_wide)
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return -2;

so this effectively stops grouping system_wide events with others,
and I think it's correct, how about events that differ in cpumask?


My understanding for the events that differ in cpumaks is, if the
leader's cpumask is not fully matched with the evsel's cpumask then we
stop the grouping. Is this understanding correct?

I have done some tests and get some conclusions:

1. If the group is mixed with core and uncore events, the system_wide checking can distinguish them.

2. If the group is mixed with core and uncore events and "-a" is
specified, the system_wide for core event is also false. So system_wide
checking can distinguish them too

3. In my test, the issue only occurs when we collect the metric which is
mixed with uncore event and core event, so maybe checking the
system_wide is OK.

should we perhaps ensure this before we call open? go throught all
groups and check they are on the same cpus?


The issue doesn't happen at most of the time (only for the metric
consisting of uncore event and core event), so fallback to stop grouping
if call open is failed looks reasonable.

Thanks
Jin Yao

thanks,
jirka


+
ÂÂÂÂÂÂÂÂ /*
ÂÂÂÂÂÂÂÂÂ * Leader must be already processed/open,
ÂÂÂÂÂÂÂÂÂ * if not it's a bug.
@@ -1665,6 +1668,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ pid = perf_thread_map__pid(threads, thread);

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ group_fd = get_group_fd(evsel, cpu, thread);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (group_fd == -2) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ errno = EINVAL;
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ err = -EINVAL;
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ goto out_close;
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
 retry_open:
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ test_attr__ready();

It enables the perf_evlist__reset_weak_group. And in the second_pass (in
__run_perf_stat), the events will be opened successfully.

I have tested OK for this fix on cascadelakex.

Thanks
Jin Yao



Is this fix OK?

Another thing is, do you think if we need to rename
"evsel->core.system_wide" to "evsel->core.has_cpumask".

The "system_wide" may misleading.

evsel->core.system_wide = pmu ? pmu->is_uncore : false;

"pmu->is_uncore" is true if PMU has a "cpumask". But it's not just uncore
PMU which has cpumask. Some other PMUs, e.g. cstate_pkg, also have cpumask.
So for this case, "has_cpumask" should be better.

so those flags are checked in many places in the code so I don't
think it's wise to mess with them

what I meant before was that the cpumask could be different for
different events so even when both events are 'system_wide' the
leader 'fd' might not exist for the groupped events and vice versa

so maybe we should ensure that we are groupping events with same
cpu maps before we go for open, so the get_group_fd stays simple


Thanks for the comments. I'm preparing the patch according to this idea.


But I'm not sure if the change is OK for other case, e.g. PT, which also
uses "evsel->core.system_wide".

plz CC Adrian Hunter <adrian.hunter@xxxxxxxxx> on next patches
if you are touching this


I will not touch "evsel->core.system_wide" in the new patch.

Thanks
Jin Yao

thanks,
jirka