Re: [PATCH 0/9] Per-cgroup /proc/stat

From: Glauber Costa
Date: Wed Sep 14 2011 - 16:21:23 EST


On 09/14/2011 05:13 PM, Peter Zijlstra wrote:
On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
[[ For those getting this twice: I sent it previously to containers
ml, but I guess it was out. Sending now to a broader audience anyway ]]

Hi,

This patchset is a simple initial proposal for a per-cgroup/container
display of /proc/stat. The display method is based on Daniel's idea of
exposing a file that can be bind mounted (Daniel, is that more or less
what you had in mind?)

To grab the stats themselves, I am (ab)using cpuacct cgroup. percpu counters
are dropped in favor of normal percpu pointers, so we can easily track
per-cpu quantities.

In case you guys like this idea, my TODO list would include the removal
of the show stat code in fs/proc/stat.c altogether, and the displaying
of some fields I haven't touched yet.

Also, to demonstrate one of the potential ideas for such method, I
implemented a feature comonly found in hypervisors - steal time - on top
of it. I arguee that containers can/should also display steal time when
available. Turns out that due to the fact that we run on the same kernel,
steal time is quite easy to implement once we have per-container tick
accounting in place.

Please let me know what you guys think

Glauber Costa (9):
Remove parent field in cpuacct cgroup
Make cpuacct fields per cpu variables
Include nice values in cpuacct
Include irq and softirq fields in cpuacct
Include guest fields in cpuacct
Include idle and iowait fields in cpuacct
Create cpuacct.proc.stat file
per-cgroup boot time
Report steal time for cgroup

kernel/sched.c | 265 +++++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 234 insertions(+), 31 deletions(-)

I hate it already.. it just smells of more senseless accounting
overhead.

Guys we should seriously trim back a lot of that code, not grow ever
more and more. The sad fact is that if you build a kernel with
cpu-cgroup support the context switch cost is more than double that of a
kernel without, and then you haven't even started creating cgroups yet.

Also, how doesn't all this duplicate part of cpuacct-cgroup?

/me won't actually look at the patches for a little while longer.
Hey Peter,

Answering just a single point here, if you look closely, it does not duplicate anything from cpuacct. What it does, is to divide it in more
fine grained groups than just user/system. But it is not even called more than it already used to be. Also, I change the counters to per-cpu variables instead of percpu counters (so we can access per-cpu data). If there is any perf. change wrt the current code, it comes from that, and since percpu variables are cheaper to update (and summing up is much less frequent), it will end up even cheaper.

The steal time feature is really trivial once it is in place.

About your point of the context switch cost, how would you feel if we optimized it out using static_branch() like it was done for kvm steal time?

I can also commit to taking a look at making the overall performance suck less here, but it is really orthogonal to what I just posted.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/