Re: Perf event operation with hotplug cpus and cgroups

From: William Cohen
Date: Mon Mar 23 2015 - 12:02:43 EST


On 03/20/2015 04:20 PM, Peter Zijlstra wrote:
> On Fri, Mar 20, 2015 at 03:41:54PM -0400, William Cohen wrote:
>>
>> There isn't any desire to aggregate the different cgroup data
>> together. The desired grouping is measurements per cgroup, kind of
>> like the pid scoping for perf but for a cgroup. It is just that the
>> way that the perf event measurements works for cgroups that the
>> measurements need to be taken system-wide.
>
> Still doesn't make any sense; if you want to monitor just the vcpu
> attach to the one task already.
>
> Without the vcpu per cgroup thing you'll never end up with O(n^2). You
> get cgroups * cpus, which is what it is.
>
> Your specific complain was about this weird setup where you place
> nr_cpus tasks in nr_cpus cgroups and then end up with O(n^2) fds.
>
> Also this isn't perf specific, cgroups _are_ system wide, so obviously
> it needs system-wide measurement.


Hi Peter,

Monitoring OpenShift gears is likely to encounter this situation where cgroup>=cpus. Each OpenShift gear is collection of processes running in a cgroup that is not pinned to a particular processor. A gear is typically limited to a fraction of a processor's time, so there are multiple gears per processor.

http://docs.openshift.org/origin-m4/oo_administration_guide.html#managing-gear-capacity

>
>>> Just measure the parent cgroup of the vcpu cgroups if you're really only
>>> interested in the virtual machine crap thing.
>>>
>>>> Given the issues with these uses cases is user-space setting up the
>>>> counters for each cpu in the system the best solution? Would it be
>>>> better to to allow the system-wide data collection to selected with
>>>> one perf event open with pid==-1 and cpu==-1? Is setup of per cpu
>>>> monitoring and aggregation of the counters across processors too
>>>> difficult to do in the kernel?
>>>
>>> Not hard at all, but useless, you need a fd per cpu in order to get your
>>> data out. Remember that the ring buffers are strictly per cpu.
>>>
>>
>> Are the ring buffers needed just for the sampling or are they also
>> needed "perf stat" type information?
>
> No counting could do this; but even there I'd worry about scalability.
> We'd need to fold the value into the 'global' counter on every cgroup
> switch, now imagine all 80 cpus context switching at high rates between
> cgroups.
>
> Also we'd need to somehow manage multiple events with a single fd,
> that's complexity we really do not need.
>
> When we started out with perf we had such global constructs and we had
> to quickly kill them for much smaller systems than this 80 cpu machine
> you talk about.
>

No question that doing frequent updates of global data structures kills performance. What about having the systemwide information information accumulated on a per cpu basis and making the read out be the slow operation having to gather the information from all the processors to avoid slowing the context switches?

What are the other complexities of managing multiple cpu performance events with a single fd? Allocating and freeing the underlying data structures on each of the processors? Starting and stopping the measurements?

-Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/