Re: [PATCH] cfq-iosched: rework seeky detection

From: Corrado Zoccolo
Date: Tue Jan 12 2010 - 03:53:12 EST

On Tue, Jan 12, 2010 at 2:49 AM, Shaohua Li <> wrote:
> On Mon, Jan 11, 2010 at 10:46:23PM +0800, Corrado Zoccolo wrote:
>> Hi,
>> On Mon, Jan 11, 2010 at 2:47 AM, Shaohua Li <> wrote:
>> > On Sat, Jan 09, 2010 at 11:59:17PM +0800, Corrado Zoccolo wrote:
>> >> Current seeky detection is based on average seek lenght.
>> >> This is suboptimal, since the average will not distinguish between:
>> >> * a process doing medium sized seeks
>> >> * a process doing some sequential requests interleaved with larger seeks
>> >> and even a medium seek can take lot of time, if the requested sector
>> >> happens to be behind the disk head in the rotation (50% probability).
>> >>
>> >> Therefore, we change the seeky queue detection to work as follows:
>> >> * each request can be classified as sequential if it is very close to
>> >> Â the current head position, i.e. it is likely in the disk cache (disks
>> >> Â usually read more data than requested, and put it in cache for
>> >> Â subsequent reads). Otherwise, the request is classified as seeky.
>> >> * an history window of the last 32 requests is kept, storing the
>> >> Â classification result.
>> >> * A queue is marked as seeky if more than 1/8 of the last 32 requests
>> >> Â were seeky.
>> >>
>> >> This patch fixes a regression reported by Yanmin, on mmap 64k random
>> >> reads.
>> > Can we not count a big request (say the request data is >= 32k) as seeky
>> > regardless the seek distance? In this way we can also make a 64k random sync
>> > read not as seeky.
>> I think I understand what you are proposing, but I don't think request
>> size should
>> matter at all for rotational disk.
> randread a 32k bs Âdefinitely has better throughput than a 4k bs. So the request
> size does matter. From iops point of view, 64k and 4k might not have difference
> in device, but from performance point of view, they have big difference.
Assume we have two queues, one with 64k requests, and an other with 4k requests,
and that our ideal disk will service them with the same IOPS 'v'.
Then, servicing for 100ms the first, and then for 100ms the second, we
will have, averaging on the
200ms period of the schedule:
first queue IOPS = v * 100/200 = v/2
second queue IOPS = v * 100/200 = v/2
Now the bandwidth will be simply IOPS * request size.
If instead, you service one request from one queue, and one from the
other (and keep switching for 200ms),
with v IOPS, each queue will obtain again v/2 IOPS, i.e. exactly the
same numbers.

But, instead, if we have a 2-disk RAID 0, with stripe >= 64k, and the
64k accesses are aligned (do not cross the stripe), we will have 50%
probability that the requests from the 2 queues are serviced in
parallel, thus increasing the total IOPS and bandwidth. This cannot
happen if you service for 100ms a single depth-1 seeky queue.

>> Usually, the disk firmware will load a big chunk of data in its cache even when
>> requested to read a single sector, and will provide following ones
>> from the cache
>> if you read them sequentially.
>> Now, in CFQ, what we really mean by saying that a queue is seeky is that
>> waiting a bit in order to serve an other request from this queue doesn't
>> give any benefit w.r.t. switching to an other queue.
> If no idle, we might switch to a random 4k access or any kind of queues. Compared
> to continue big request access and switch to other queue with small block, no switching
> does give benefit.
CFQ in 2.6.33 works differently than it worked before.
Now, seeky queues have an aggregate time slice, and within this time
slice, you will switch
between seeky queues fairly. So it cannot happen that a seeky queue
loses its time slice.

>> So, if you read a single 64k block from disk and then seek, then you can service
>> any other request without losing bandwidth.
> But the 64k bs queue loses its slice, which might means device serves more 4k access.
> As a result, reduce bandwidth.
If both queues are backlogged and at the same priority, they will be
serviced fairly.
If one queue has large think time (or lower priority), the other will
be serviced more often.
>> Instead, if you are reading 4k, then the next ones (and so on up to 64k, as it
>> happens with mmap when you fault in a single page at a time), then it
>> is convenient
>> to wait for the next request, since it has 3/4 of changes to be
>> sequential, so be
>> serviced by cache.
>> I'm currently testing a patch to consider request size in SSDs, instead.
>> In SSDs, the location of the request doesn't mean anything, but the
>> size is meaningful.
>> Therefore, submitting together many small requests from different
>> queues can improve
>> the overall performance.
> Agree.
> Thanks,
> Shaohua

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at