Re: [patch]cfq-iosched: delete deep seeky queue idle logic

From: Corrado Zoccolo
Date: Wed Sep 28 2011 - 03:09:44 EST


On Tue, Sep 27, 2011 at 8:33 AM, Shaohua Li <shaohua.li@xxxxxxxxx> wrote:
> On Tue, 2011-09-27 at 14:07 +0800, Corrado Zoccolo wrote:
>> On Mon, Sep 26, 2011 at 2:55 AM, Shaohua Li <shaohua.li@xxxxxxxxx> wrote:
>> > On Fri, 2011-09-23 at 13:50 +0800, Corrado Zoccolo wrote:
>> >> Il giorno 21/set/2011 13:16, "Shaohua Li" <shaohua.li@xxxxxxxxx> ha
>> >> scritto:
>> >> >
>> >> > On Sat, 2011-09-17 at 03:25 +0800, Corrado Zoccolo wrote:
>> >> > > On Fri, Sep 16, 2011 at 8:40 AM, Shaohua Li <shaohua.li@xxxxxxxxx>
>> >> wrote:
>> >> > > > On Fri, 2011-09-16 at 14:04 +0800, Corrado Zoccolo wrote:
>> >> > > >> On Fri, Sep 16, 2011 at 5:09 AM, Shaohua Li
>> >> <shaohua.li@xxxxxxxxx> wrote:
>> >> > > >> > Recently Maxim and I discussed why his aiostress workload
>> >> performs poorly. If
>> >> > > >> > you didn't follow the discussion, here are the issues we
>> >> found:
>> >> > > >> > 1. cfq seeky dection isn't good. Assume a task accesses
>> >> sector A, B, C, D, A+1,
>> >> > > >> > B+1, C+1, D+1, A+2...Accessing A, B, C, D is random. cfq will
>> >> detect the queue
>> >> > > >> > as seeky, but since when accessing A+1, A+1 is already in
>> >> disk cache, this
>> >> > > >> > should be detected as sequential really. Not sure if any real
>> >> workload has such
>> >> > > >> > access patern, and seems not easy to have a clean fix too.
>> >> Any idea for this?
>> >> > > >>
>> >> > > >> Not all disks will cache 4 independent streams, we can't make
>> >> that
>> >> > > >> assumption in cfq.
>> >> > > > sure thing. we can't make such assumption. I'm thinking if we
>> >> should
>> >> > > > move the seeky detection in request finish. If time between two
>> >> requests
>> >> > > > finish is short, we thought the queue is sequential. This will
>> >> make the
>> >> > > > detection adaptive. but seems time measurement isn't easy.
>> >> > > >
>> >> > > >> The current behaviour of assuming it as seeky should work well
>> >> enough,
>> >> > > >> in fact it will be put in the seeky tree, and it can enjoy the
>> >> seeky
>> >> > > >> tree quantum of time. If the second round takes a short time,
>> >> it will
>> >> > > >> be able to schedule a third round again after the idle time.
>> >> > > >> If there are other seeky processes competing for the tree, the
>> >> cache
>> >> > > >> can be cleared by the time it gets back to your 4 streams
>> >> process, so
>> >> > > >> it will behave exactly as a seeky process from cfq point of
>> >> view.
>> >> > > >> If the various accesses were submitted in parallel, the deep
>> >> seeky
>> >> > > >> queue logic should kick in and make sure the process gets a
>> >> sequential
>> >> > > >> quantum, rather than sharing it with other seeky processes, so
>> >> > > >> depending on your disk, it could perform better.
>> >> > > > yes, the idle logic makes it ok, but sounds like "make things
>> >> wrong
>> >> > > > first (in seeky detection) and then fix it later (the idle
>> >> logic)".
>> >> > > >
>> >> > > >> > 2. deep seeky queue idle. This makes raid performs poorly. I
>> >> would think we
>> >> > > >> > revert the logic. Deep queue is more popular with high end
>> >> hardware. In such
>> >> > > >> > hardware, we'd better not do idle.
>> >> > > >> > Note, currently we set a queue's slice after the first
>> >> request is finished.
>> >> > > >> > This means the drive already idles a little time. If the
>> >> queue is truely deep,
>> >> > > >> > new requests should already come in, so idle isn't required.
>> >> > > > What did you think about this? Assume seeky request takes long
>> >> time, so
>> >> > > > the queue is already idling for a little time.
>> >> > > I don't think I understand. If cfq doesn't idle, it will dispatch
>> >> an
>> >> > > other request from the same or an other queue (if present)
>> >> > > immediately, until all possible in-flight requests are sent. Now,
>> >> you
>> >> > > depend on NCQ for the order requests are handled, so you cannot
>> >> > > guarantee fairness any more.
>> >> > >
>> >> > > >
>> >> > > >> > Looks Vivek used to post a patch to rever it, but it gets
>> >> ignored.
>> >> > > >> >
>> >> http://us.generation-nt.com/patch-cfq-iosched-revert-logic-deep-queues-help-198339681.html
>> >> > > >> I get a 404 here. I think you are seeing only one half of the
>> >> medal.
>> >> > > >> That logic is there mainly to ensure fairness between deep
>> >> seeky
>> >> > > >> processes and normal seeky processes that want low latency.
>> >> > > > didn't understand it. The logic doesn't protect non-deep
>> >> process. how
>> >> > > > could it make the normal seeky process have low latency? or did
>> >> you have
>> >> > > > a test case for this, so I can analyze?
>> >> > > > I tried a workload with one task drives depth 4 and one task
>> >> drives
>> >> > > > depth 16. Appears the behavior isn't changed w/wo the logic.
>> >> > sorry for the delay.
>> >> >
>> >> > > Try a workload with one shallow seeky queue and one deep (16) one,
>> >> on
>> >> > > a single spindle NCQ disk.
>> >> > > I think the behaviour when I submitted my patch was that both were
>> >> > > getting 100ms slice (if this is not happening, probably some
>> >> > > subsequent patch broke it).
>> >> > > If you remove idling, they will get disk time roughly in
>> >> proportion
>> >> > > 16:1, i.e. pretty unfair.
>> >> > I thought you are talking about a workload with one thread depth 4,
>> >> and
>> >> > the other thread depth 16. I did some tests here. In an old kernel,
>> >> > without the deep seeky idle logic, the threads have disk time in
>> >> > proportion 1:5. With it, they get almost equal disk time. SO this
>> >> > reaches your goal. In a latest kernel, w/wo the logic, there is no
>> >> big
>> >> > difference (the 16 depth thread get about 5x more disk time). With
>> >> the
>> >> > logic, the depth 4 thread gets equal disk time in first several
>> >> slices.
>> >> > But after an idle expiration(mostly because current block plug hold
>> >> > requests in task list and didn't add them to elevator), the queue
>> >> never
>> >> > gets detected as deep, because the queue dispatch request one by
>> >> one. So
>> >> > the logic is already broken for some time (maybe since block plug is
>> >> > added).
>> >> Could be that dispatching requests one by one is harming the
>> >> performance, then?
>> > Not really. Say 4 requests are running, the task dispatches a request
>> > after one previous request is completed. requests are dispatching one by
>> > one but there are still 4 requests running at any time. Checking the
>> > in_flight requests are more precise for the deep detection.
>> >
>> What happens if there are 4 tasks, all that could dispatch 4 requests
>> in parallel? Will we reach and sustain 16 in flight requests, or it
>> will bounce around 4 in flight? I think here we could get a big
>> difference.
> ah, yes, we really should change
> if (cfqq->queued[0] + cfqq->queued[1] >= 4)
> Â Â Â Â Â Â Â Âcfq_mark_cfqq_deep(cfqq);
> to
> if (cfqq->queued[0] + cfqq->queued[1] + cfqq->cfqq->dispatched >= 4)
> Â Â Â Â Â Â Â Âcfq_mark_cfqq_deep(cfqq);
> this is a bug, though this isn't related to the original raid issue.
>
>> Probably it is better to move the deep queue detection logic in the
>> per-task queue?
> it's per-task queue currently.
>
>> Then cfq will decide if it should dispatch few requests from every
>> task (shallow case) or all requests from a single task (deep), and
>> then idle.
> don't get your point. detect the deep logic considering all tasks?
No no, I was just sayng that if no queue is marked deep, then they
will end up all on the seeky tree, so cfq will dispatch requests from
each before idling. This is the ideal case for raid, I think, so you
only need to identify the sweet spot in which you prefer to insulate a
seeky queue (mark it as deep) from the others, for the reasons that
Vivek also points out, i.e. that a single queue, even being seeky,
could have requests more concentrated, so the seek time could be
shorter.

Thanks,
Corrado

> Thanks,
> Shaohua
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/