Block IO controller hierarchy suppport (Was: Re: [PATCH RFCcgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support andwhine if cgroups are nested for them)

From: Vivek Goyal
Date: Thu Sep 13 2012 - 10:55:02 EST

On Wed, Sep 12, 2012 at 10:09:33AM -0700, Tejun Heo wrote:

> Yeah, it's mostly that cfq was already a hairy monster before blkcg
> was added to it and unfortunately we didn't make it any cleaner in the
> process and blkcg itself has a lot of other issues including being
> completely broken w.r.t. writeback writes. In addition there are two
> sub-controllers - the cfq one and blk-throttle. So, it's just that
> there are too many scary things to do and not enough man power or
> maybe interest. I hope we could just declare cgroup isn't supported
> on block devices but that doesn't seem feasible at this point either.
> I might / probably work on it and am hoping to coerce Vivek into it
> too. If you wanna jump in, please be my guest.

Biggest problem with blkcg CFQ implementation is idling on cgroup. If
we don't idle on cgroup, then we don't get the service differentiaton
for most of the workloads and if we do idle then performance starts
to suck very soon (The moment few cgroups are created). And hierarchy
will just exacertbate this problem because then one will try to idle
at each group in hierarchy.

This problem is something similar to CFQ's idling on sequential queues
and iopriority. Because we never idled on random IO queue, ioprios never
worked on random IO queues. And same is true for buffered write queues.
Similary, if you don't idle on groups, then for most of the workloads,
service differentiation is not visible. Only the one which are highly
sequential on nature, one can see service differentiation.

That's one fundamental problem for which we need to have a good answer
before we try to do more work on blkcg. Because we can write as much
code but at the end of the day it might still not be useful because
of the above mentioned issue I faced.

And that's the reason I think blkcg is primarly useful when you create
number of cgroups very small and move offending/problem creating worklods
in that cgroup and keep all other running in root cgroup. That way you
get less idling due to less number of cgroups at the same time you have
provided more isolation from offending workloads.

So if anybody has ideas on how to address above issue, I am all ears.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at