Re: [RFC v1] add new io-scheduler to use cgroup on high-speed device

From: sanbai
Date: Sat Jun 08 2013 - 00:39:13 EST

On 2013-06-08- 11:50, sanbai wrote:
On 2013å06æ08æ 03:53, Vivek Goyal wrote:
On Fri, Jun 07, 2013 at 11:09:54AM +0800, sanbai wrote:
On 2013å06æ05æ 21:30, Vivek Goyal wrote:
On Wed, Jun 05, 2013 at 10:09:31AM +0800, Robin Dong wrote:
We want to use blkio.cgroup on high-speed device (like fusionio) for our mysql clusters.
After testing different io-scheduler, we found that cfq is too slow and deadline can't run on cgroup.
So why not enhance deadline to be able to be used with cgroups instead of
coming up with a new scheduler?
I think if we add cgroups support into deadline, it will not be
suitable to call "deadline" a new ioscheduler and a new
name may not confuse users.
Nobody got confused when we added cgroup support to CFQ. Not that
I am saying go add support to deadline. I am just saying that need
for cgroup support does not sound like it justfies need of a new
IO scheduler.

Can you give more details. Do you idle? Idling kills performance. If not,
then without idling how do you achieve performance differentiation.
We don't idle, when comes to .elevator_dispatch_fnïwe just compute
quota for every group:

quota = nr_requests - rq_in_driver;
group_quota = quota * group_weight / total_weight;

and dispatch 'group_quota' requests for the coordinate group.
Therefore high-weight group
will dispatch more requests than low-weight group.
Ok, this works only if all the groups are full all the time otherwise
groups will lose their fair share. This simplifies the things a lot.
That is fairness is provided only if group is always backlogged. In
practice, this happens only if a group is doing IO at very high rate
(like your fio scripts). Have you tried running any real life workload
in these cgroups (apache, databases etc) and see how good is service

Anyway, sounds like this can be done at generic block layer like
blk-throtl and it can sit on top so that it can work with all schedulers
and can also work with bio based block drivers.
That's a new idea, I will give a try later.

I do the test again for cfq (slice_idle=0, quatum=128) and tpps

cfq (slice_idle=0, quatum=128)
groupname iops avg-rt(ms) max-rt(ms)
test1 16148 15 188
test2 12756 20 117
test3 9778 26 268
test4 6198 41 209

groupname iops avg-rt(ms) max-rt(ms)
test1 17292 14 65
test2 15221 16 80
test3 12080 21 66
test4 7995 32 90

Looks cfq with is much better than before.
Yep, I am sure there are more simple opportunites for optimization
where it can help. Can you try couple more things.

- Drive even deeper queue depth. Set quantum=512.

- set group_idle=0.
I changed the iodepth to 512 in fio script and the new result is:

cfq (group_idle=0, quantum=512)
groupname iops avg-rt(ms) max-rt(ms)
test1 15259 33 305
test2 11858 42 345
test3 8885 57 335
test4 5738 89 355

cfq (group_idle=0, quantum=512, slice_sync=10)
groupname iops avg-rt(ms) max-rt(ms)
test1 16507 31 177
test2 12896 39 366
test3 9301 55 188
test4 6023 84 545

groupname iops avg-rt(ms) max-rt(ms)
test1 16316 31 99
test2 15066 33 106
test3 12182 42 101
test4 8350 61 180

looks cfq works much better now.

But after I changed to 'randrw', the condition is a little different:

cfq (group_idle=0, quantum=512, slice_sync=10,slice_async=10)
groupname iops(r/w) avg-rt(ms) max-rt(ms)
test1 8717/8726 26/31 553/576
test2 6944/6943 34/39 507/514
test3 4974/4961 49/53 725/658
test4 3117/3109 79/84 1107/1094

groupname iops(r/w) avg-rt(ms) max-rt(ms)
test1 9130/9147 25/30 85/98
test2 7644/7652 30/36 103/118
test3 5727/5733 41/47 132/146
test4 3889/3891 62/68 193/214

Ideally this should effectively emulate what you are doing. That is try
to provide fairness without idling on group.

In practice I could not keep group queue full and before group exhausted
its slice, it got empty and got deleted from service tree and lost its
fair share. So if group_idle=0 leads to no service differentiation,
try slice_sync=10 and see what happens.



Robin Dong


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at