Re: CFQ read performance regression

From: Miklos Szeredi
Date: Fri Apr 23 2010 - 06:57:10 EST


On Thu, 2010-04-22 at 16:31 -0400, Vivek Goyal wrote:
> On Thu, Apr 22, 2010 at 09:59:14AM +0200, Corrado Zoccolo wrote:
> > Hi Miklos,
> > On Wed, Apr 21, 2010 at 6:05 PM, Miklos Szeredi <mszeredi@xxxxxxx> wrote:
> > > Jens, Corrado,
> > >
> > > Here's a graph showing the number of issued but not yet completed
> > > requests versus time for CFQ and NOOP schedulers running the tiobench
> > > benchmark with 8 threads:
> > >
> > > http://www.kernel.org/pub/linux/kernel/people/mszeredi/blktrace/queue-depth.jpg
> > >
> > > It shows pretty clearly the performance problem is because CFQ is not
> > > issuing enough request to fill the bandwidth.
> > >
> > > Is this the correct behavior of CFQ or is this a bug?
> > This is the expected behavior from CFQ, even if it is not optimal,
> > since we aren't able to identify multi-splindle disks yet.
>
> In the past we were of the opinion that for sequential workload multi spindle
> disks will not matter much as readahead logic (in OS and possibly in
> hardware also) will help. For random workload we anyway don't idle on the
> single cfqq so it is fine. But my tests now seem to be telling a different
> story.
>
> I also have one FC link to one of the HP EVA and I am running increasing
> number of sequential readers to see if throughput goes up as number of
> readers go up. The results are with noop and cfq. I do flush OS caches
> across the runs but I have no control on caching on HP EVA.
>
> Kernel=2.6.34-rc5
> DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe
> Workload=bsr iosched=cfq Filesz=2G bs=4K
> =========================================================================
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 1 1 135366 59024 0 0
> bsr 1 2 124256 126808 0 0
> bsr 1 4 132921 341436 0 0
> bsr 1 8 129807 392904 0 0
> bsr 1 16 129988 773991 0 0
>
> Kernel=2.6.34-rc5
> DIR=/mnt/iostestmnt/fio DEV=/dev/mapper/mpathe
> Workload=bsr iosched=noop Filesz=2G bs=4K
> =========================================================================
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 1 1 126187 95272 0 0
> bsr 1 2 185154 72908 0 0
> bsr 1 4 224622 88037 0 0
> bsr 1 8 285416 115592 0 0
> bsr 1 16 348564 156846 0 0
>

These numbers are very similar to what I got.

> So in case of NOOP, throughput shotup to 348MB/s but CFQ reamains more or
> less constat, about 130MB/s.
>
> So atleast in this case, a single sequential CFQ queue is not keeing the
> disk busy enough.
>
> I am wondering why my testing results were different in the past. May be
> it was a different piece of hardware and behavior various across hardware?

Probably. I haven't seen this type of behavior on other hardware.

> Anyway, if that's the case, then we probably need to allow IO from
> multiple sequential readers and keep a watch on throughput. If throughput
> drops then reduce the number of parallel sequential readers. Not sure how
> much of code that is but with multiple cfqq going in parallel, ioprio
> logic will more or less stop working in CFQ (on multi-spindle hardware).

Have you tested on older kernels? Around 2.6.16 it seemed to allow more
parallel reads, but that might have been just accidental (due to I/O
being submitted in a different pattern).

Thanks,
Miklos

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/