Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

From: Peter Zijlstra
Date: Fri Apr 16 2021 - 07:59:54 EST


On Fri, Apr 16, 2021 at 08:22:38PM +0900, Namhyung Kim wrote:
> On Fri, Apr 16, 2021 at 7:28 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Apr 16, 2021 at 11:29:30AM +0200, Peter Zijlstra wrote:
> >
> > > > So I think we've had proposals for being able to close fds in the past;
> > > > while preserving groups etc. We've always pushed back on that because of
> > > > the resource limit issue. By having each counter be a filedesc we get a
> > > > natural limit on the amount of resources you can consume. And in that
> > > > respect, having to use 400k fds is things working as designed.
> > > >
> > > > Anyway, there might be a way around this..
> >
> > So how about we flip the whole thing sideways, instead of doing one
> > event for multiple cgroups, do an event for multiple-cpus.
> >
> > Basically, allow:
> >
> > perf_event_open(.pid=fd, cpu=-1, .flag=PID_CGROUP);
> >
> > Which would have the kernel create nr_cpus events [the corrolary is that
> > we'd probably also allow: (.pid=-1, cpu=-1) ].
>
> Do you mean it'd have separate perf_events per cpu internally?
> From a cpu's perspective, there's nothing changed, right?
> Then it will have the same performance problem as of now.

Yes, but we'll not end up in ioctl() hell. The interface is sooo much
better. The performance thing just means we need to think harder.

I thought cgroup scheduling got a lot better with the work Ian did a
while back? What's the actual bottleneck now?

> > Output could be done by adding FORMAT_PERCPU, which takes the current
> > read() format and writes a copy for each CPU event. (p)read(v)() could
> > be used to explode or partial read that.
>
> Yeah, I think it's good for read. But what about mmap?
> I don't think we can use file offset since it's taken for auxtrace.
> Maybe we can simply disallow that..

Are you actually using mmap() to read? I had a proposal for FORMAT_GROUP
like thing for mmap(), but I never implemented that (didn't get the
enthousiatic response I thought it would). But yeah, there's nowhere
near enough space in there for PERCPU.

Not sure how to do that, these counters must not be sampling counters
because we can't be sharing a buffer from multiple CPUs, so data/aux
just isn't a concern. But it's weird to have them magically behave
differently.