Re: [patch]cfq-iosched: delete deep seeky queue idle logic

From: Corrado Zoccolo
Date: Tue Sep 27 2011 - 02:07:33 EST


On Mon, Sep 26, 2011 at 2:55 AM, Shaohua Li <shaohua.li@xxxxxxxxx> wrote:
> On Fri, 2011-09-23 at 13:50 +0800, Corrado Zoccolo wrote:
>> Il giorno 21/set/2011 13:16, "Shaohua Li" <shaohua.li@xxxxxxxxx> ha
>> scritto:
>> >
>> > On Sat, 2011-09-17 at 03:25 +0800, Corrado Zoccolo wrote:
>> > > On Fri, Sep 16, 2011 at 8:40 AM, Shaohua Li <shaohua.li@xxxxxxxxx>
>> wrote:
>> > > > On Fri, 2011-09-16 at 14:04 +0800, Corrado Zoccolo wrote:
>> > > >> On Fri, Sep 16, 2011 at 5:09 AM, Shaohua Li
>> <shaohua.li@xxxxxxxxx> wrote:
>> > > >> > Recently Maxim and I discussed why his aiostress workload
>> performs poorly. If
>> > > >> > you didn't follow the discussion, here are the issues we
>> found:
>> > > >> > 1. cfq seeky dection isn't good. Assume a task accesses
>> sector A, B, C, D, A+1,
>> > > >> > B+1, C+1, D+1, A+2...Accessing A, B, C, D is random. cfq will
>> detect the queue
>> > > >> > as seeky, but since when accessing A+1, A+1 is already in
>> disk cache, this
>> > > >> > should be detected as sequential really. Not sure if any real
>> workload has such
>> > > >> > access patern, and seems not easy to have a clean fix too.
>> Any idea for this?
>> > > >>
>> > > >> Not all disks will cache 4 independent streams, we can't make
>> that
>> > > >> assumption in cfq.
>> > > > sure thing. we can't make such assumption. I'm thinking if we
>> should
>> > > > move the seeky detection in request finish. If time between two
>> requests
>> > > > finish is short, we thought the queue is sequential. This will
>> make the
>> > > > detection adaptive. but seems time measurement isn't easy.
>> > > >
>> > > >> The current behaviour of assuming it as seeky should work well
>> enough,
>> > > >> in fact it will be put in the seeky tree, and it can enjoy the
>> seeky
>> > > >> tree quantum of time. If the second round takes a short time,
>> it will
>> > > >> be able to schedule a third round again after the idle time.
>> > > >> If there are other seeky processes competing for the tree, the
>> cache
>> > > >> can be cleared by the time it gets back to your 4 streams
>> process, so
>> > > >> it will behave exactly as a seeky process from cfq point of
>> view.
>> > > >> If the various accesses were submitted in parallel, the deep
>> seeky
>> > > >> queue logic should kick in and make sure the process gets a
>> sequential
>> > > >> quantum, rather than sharing it with other seeky processes, so
>> > > >> depending on your disk, it could perform better.
>> > > > yes, the idle logic makes it ok, but sounds like "make things
>> wrong
>> > > > first (in seeky detection) and then fix it later (the idle
>> logic)".
>> > > >
>> > > >> > 2. deep seeky queue idle. This makes raid performs poorly. I
>> would think we
>> > > >> > revert the logic. Deep queue is more popular with high end
>> hardware. In such
>> > > >> > hardware, we'd better not do idle.
>> > > >> > Note, currently we set a queue's slice after the first
>> request is finished.
>> > > >> > This means the drive already idles a little time. If the
>> queue is truely deep,
>> > > >> > new requests should already come in, so idle isn't required.
>> > > > What did you think about this? Assume seeky request takes long
>> time, so
>> > > > the queue is already idling for a little time.
>> > > I don't think I understand. If cfq doesn't idle, it will dispatch
>> an
>> > > other request from the same or an other queue (if present)
>> > > immediately, until all possible in-flight requests are sent. Now,
>> you
>> > > depend on NCQ for the order requests are handled, so you cannot
>> > > guarantee fairness any more.
>> > >
>> > > >
>> > > >> > Looks Vivek used to post a patch to rever it, but it gets
>> ignored.
>> > > >> >
>> http://us.generation-nt.com/patch-cfq-iosched-revert-logic-deep-queues-help-198339681.html
>> > > >> I get a 404 here. I think you are seeing only one half of the
>> medal.
>> > > >> That logic is there mainly to ensure fairness between deep
>> seeky
>> > > >> processes and normal seeky processes that want low latency.
>> > > > didn't understand it. The logic doesn't protect non-deep
>> process. how
>> > > > could it make the normal seeky process have low latency? or did
>> you have
>> > > > a test case for this, so I can analyze?
>> > > > I tried a workload with one task drives depth 4 and one task
>> drives
>> > > > depth 16. Appears the behavior isn't changed w/wo the logic.
>> > sorry for the delay.
>> >
>> > > Try a workload with one shallow seeky queue and one deep (16) one,
>> on
>> > > a single spindle NCQ disk.
>> > > I think the behaviour when I submitted my patch was that both were
>> > > getting 100ms slice (if this is not happening, probably some
>> > > subsequent patch broke it).
>> > > If you remove idling, they will get disk time roughly in
>> proportion
>> > > 16:1, i.e. pretty unfair.
>> > I thought you are talking about a workload with one thread depth 4,
>> and
>> > the other thread depth 16. I did some tests here. In an old kernel,
>> > without the deep seeky idle logic, the threads have disk time in
>> > proportion 1:5. With it, they get almost equal disk time. SO this
>> > reaches your goal. In a latest kernel, w/wo the logic, there is no
>> big
>> > difference (the 16 depth thread get about 5x more disk time). With
>> the
>> > logic, the depth 4 thread gets equal disk time in first several
>> slices.
>> > But after an idle expiration(mostly because current block plug hold
>> > requests in task list and didn't add them to elevator), the queue
>> never
>> > gets detected as deep, because the queue dispatch request one by
>> one. So
>> > the logic is already broken for some time (maybe since block plug is
>> > added).
>> Could be that dispatching requests one by one is harming the
>> performance, then?
> Not really. Say 4 requests are running, the task dispatches a request
> after one previous request is completed. requests are dispatching one by
> one but there are still 4 requests running at any time. Checking the
> in_flight requests are more precise for the deep detection.
>
What happens if there are 4 tasks, all that could dispatch 4 requests
in parallel? Will we reach and sustain 16 in flight requests, or it
will bounce around 4 in flight? I think here we could get a big
difference.
Probably it is better to move the deep queue detection logic in the
per-task queue?
Then cfq will decide if it should dispatch few requests from every
task (shallow case) or all requests from a single task (deep), and
then idle.

Thanks
Corrado
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/