Re: [RFC PATCH 0/2] perf_events: add support for per-cpu per-cgroup monitoring

From: Stephane Eranian
Date: Thu Sep 02 2010 - 04:17:21 EST


On Thu, Sep 2, 2010 at 5:53 AM, Lin Ming <lin@xxxxxxx> wrote:
> On Tue, Aug 31, 2010 at 11:25 PM, Stephane Eranian <eranian@xxxxxxxxxx> wrote:
>> This series of patches adds per-container (cgroup) filtering capability
>> to per-cpu monitoring. In other words, we can monitor all threads belonging
>> to a specific cgroup and running on a specific CPU.
>>
>> This is useful to measure what is going on inside a cgroup. Something that
>> cannot easily and cheaply be achieved with either per-thread or per-cpu mode.
>> Cgroups can span multiple CPUs. CPUs can be shared between cgroups. Cgroups
>> can have lots of threads. Threads can come and go during a measurement.
>>
>> To measure per-cgroup today requires using per-thread mode and attaching to
>> all the current threads inside a cgroup and tracking new threads. That would
>> require scanning of /proc/PID, which is subject to race conditions, and
>> creating an event for each thread, each event requiring kernel memory.
>>
>> The approach taken by this patch is to leverage the per-cpu mode by simply
>> adding a filtering capability on context switch only when necessary. That
>> way the amount of kernel memory used remains bound by the number of CPUs.
>> We also do not have to scan /proc. We are only interested in cgroup level
>> counts, not per-thread.
>>
>> The cgroup to monitor is designated by passing a file descriptor opened
>> on a new per-cgroup file in the cgroup filesystem (perf_event.perf). The
>> option must be activated by setting perf_event_attr.cgroup=1 and passing
>> a valid file descriptor in perf_event_attr.cgroup_fd. Those are the only
>> two ABI extensions.
>>
>> The patch also includes changes to the perf tool to make use of cgroup
>> filtering. Both perf stat and perf record have been extended to support
>> cgroup via a new -G option. The cgroup is specified per event:
>>
>> $ perf stat -a -e cycles:u,cycles:u -G test1,test2 -- sleep 1
>> ÂPerformance counter stats for 'sleep 1':
>>     2368881622 Âcycles          test1
>>         Â0 Âcycles          test2
>> Â Â Â Â1.001938136 Âseconds time elapsed
>
> I have tried this new feature. Cool!
>
> perf stat [<options>] [<command>]
>
> Is the command ("sleep 1" in above example) also counted?
>
If it runs in the cgroup that is measured then yes. Not that it will
do much.

I am working on a second version of the patch that will correct
the issue with timing, and in particular time_enabled. In cgroup
mode, it need to count the time the cgroup was active, and not
wall-clock. That will make the scaling more meaningful.

> Thanks,
> Lin Ming
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/