Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1

From: Shaohua Li
Date: Tue Jan 19 2010 - 20:29:53 EST

Next message: Corey Ashford: "Re: [RFC] perf_events: support for uncore a.k.a. nest units"
Previous message: Tejun Heo: "Re: [PATCH 38/40] cifs: use workqueue instead of slow-work"
In reply to: Vivek Goyal: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Next in thread: Jeff Moyer: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 2010-01-19 at 13:40 -0800, Vivek Goyal wrote:
> On Tue, Jan 19, 2010 at 09:10:33PM +0100, Corrado Zoccolo wrote:
> > On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin
> > <yanmin_zhang@xxxxxxxxxxxxxxx> wrote:
> > > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote:
> > >> Hi Yanmin
> > >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@xxxxxxxxx> wrote:
> > >> > Hi Yanmin,
> > >> >> When low_latency=1, we get the biggest number with kernel 2.6.32.
> > >> >> Comparing with low_latency=0's result, the prior one is about 4% better.
> > >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with
> > >> > fastest 2.6.32, so we can consider the first part of the problem
> > >> > solved.
> > >> >
> > >> I think we can return now to your full script with queue merging.
> > >> I'm wondering if (in arm_slice_timer):
> > >> - if (cfqq->dispatched)
> > >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd)))
> > >> return;
> > >> gives the same improvement you were experiencing just reverting to rq_in_driver.
> > > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k
> > > has about 20% improvement. With just checking rq_in_driver(cfqd), it has
> > > about 33% improvement.
> > >
> > Jeff, do you have an idea why in arm_slice_timer, checking
> > rq_in_driver instead of cfqq->dispatched gives so much improvement in
> > presence of queue merging, while it doesn't have noticeable effect
> > when there are no merges?
>
> Performance improvement because of replacing cfqq->dispatched with
> rq_in_driver() is really strange. This will mean we will do even lesser
> idling on the cfqq. That means faster cfqq switching and that should mean more
> seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on
> the queue.
I used to look at the issue, but not fully understand it. Some
interesting finding:
the cfqq->dispatched cause cfq_select_queue frequently switch queues.
it appears frequent switch can make we could quickly switch to
sequential requests in the workload. without the cfqq->dispatched, we
dispatch queue1 request, M requests from other queues, queue1 request.
with it, we dispatch queue1 request, N requests from other queues,
queue1 request. It appears M < N from blktrace, which cause we have less
seeky. I don't see any other obvious difference from blktrace in the two
cases.

Thanks,
Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Corey Ashford: "Re: [RFC] perf_events: support for uncore a.k.a. nest units"
Previous message: Tejun Heo: "Re: [PATCH 38/40] cifs: use workqueue instead of slow-work"
In reply to: Vivek Goyal: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Next in thread: Jeff Moyer: "Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]