Re: [PATCH v3 10/10] perf/cgroup: Do not switch system-wide events in cgroup switch

From: Liang, Kan
Date: Thu Nov 14 2019 - 10:16:59 EST




On 11/14/2019 8:57 AM, Peter Zijlstra wrote:
On Thu, Nov 14, 2019 at 08:46:51AM -0500, Liang, Kan wrote:


On 11/14/2019 5:43 AM, Peter Zijlstra wrote:
On Wed, Nov 13, 2019 at 04:30:42PM -0800, Ian Rogers wrote:
From: Kan Liang <kan.liang@xxxxxxxxxxxxxxx>

When counting system-wide events and cgroup events simultaneously, the
system-wide events are always scheduled out then back in during cgroup
switches, bringing extra overhead and possibly missing events. Switching
out system wide flexible events may be necessary if the scheduled in
task's cgroups have pinned events that need to be scheduled in at a higher
priority than the system wide flexible events.

I'm thinking this patch is actively broken. groups->index 'group' wide
and therefore across cpu/cgroup boundaries.

There is no !cgroup to cgroup hierarchy as this patch seems to assume,
specifically look at how the merge sort in visit_groups_merge() allows
cgroup events to be picked before !cgroup events.


No, the patch intends to avoid switch !cgroup during cgroup context switch.

Which is wrong.

Why we want to switch !cgroup system-wide event in context switch?

How should current perf handle this case?
For example,
User A: perf stat -e cycles -G cgroup1
User B: perf stat -e instructions -a

There is only one cpuctx for each CPU. So both cycles and instructions are tracked in flexible_active list.
When user A left, the cgroup context-switch schedule out everything including both cycles and instructions.
It seems that we will never switch the instructions event back for user B.


Thanks,
Kan