Re: io-scheduler tuning for better read/write ratio

From: Ralf Gross
Date: Mon Jun 29 2009 - 05:49:43 EST


Wu Fengguang schrieb:
> On Fri, Jun 26, 2009 at 06:44:06PM +0800, Jens Axboe wrote:
> > On Fri, Jun 26 2009, Wu Fengguang wrote:
> > > On Tue, Jun 23, 2009 at 03:42:46AM +0800, Jeff Moyer wrote:
> > > > Ralf Gross <rg@xxxxxxxxxxxxxxxxxxxxxxx> writes:
> > > >
> > > > > Jeff Moyer schrieb:
> > > > >> Jeff Moyer <jmoyer@xxxxxxxxxx> writes:
> > > > >>
> > > > >> > Ralf Gross <rg@xxxxxxxxxxxxxxxxxxxxxxx> writes:
> > > > >> >
> > > > >> >> Casey Dahlin schrieb:
> > > > >> >>> On 06/16/2009 02:40 PM, Ralf Gross wrote:
> > > > >> >>> > David Newall schrieb:
> > > > >> >>> >> Ralf Gross wrote:
> > > > >> >>> >>> write throughput is much higher than the read throughput (40 MB/s
> > > > >> >>> >>> read, 90 MB/s write).
> > > > >> >>> >
> > > > >> >>> > Hm, but I get higher read throughput (160-200 MB/s) if I don't write
> > > > >> >>> > to the device at the same time.
> > > > >> >>> >
> > > > >> >>> > Ralf
> > > > >> >>>
> > > > >> >>> How specifically are you testing? It could depend a lot on the
> > > > >> >>> particular access patterns you're using to test.
> > > > >> >>
> > > > >> >> I did the basic tests with tiobench. The real test is a test backup
> > > > >> >> (bacula) with 2 jobs that create 2 30 GB spool files on that device.
> > > > >> >> The jobs partially write to the device in parallel. Depending which
> > > > >> >> spool file reaches the 30 GB first, one starts reading from that file
> > > > >> >> and writing to tape, while to other is still spooling.
> > > > >> >
> > > > >> > We are missing a lot of details, here. I guess the first thing I'd try
> > > > >> > would be bumping up the max_readahead_kb parameter, since I'm guessing
> > > > >> > that your backup application isn't driving very deep queue depths. If
> > > > >> > that doesn't work, then please provide exact invocations of tiobench
> > > > >> > that reprduce the problem or some blktrace output for your real test.
> > > > >>
> > > > >> Any news, Ralf?
> > > > >
> > > > > sorry for the delay. atm there are large backups running and using the
> > > > > raid device for spooling. So I can't do any tests.
> > > > >
> > > > > Re. read ahead: I tested different settings from 8Kb to 65Kb, this
> > > > > didn't help.
> > > > >
> > > > > I'll do some more tests when the backups are done (3-4 more days).
> > > >
> > > > The default is 128KB, I believe, so it's strange that you would test
> > > > smaller values. ;) I would try something along the lines of 1 or 2 MB.
> > > >
> > > > I'm CCing Fengguang in case he has any suggestions.
> > >
> > > Jeff, thank you for the forwarding (and sorry for the long delay)!
> > >
> > > The read:write (or rather sync:async) ratio control is an IO scheduler
> > > feature. CFQ has parameters slice_sync and slice_async for that.
> > > What's more, CFQ will let async IO wait if there are any in flight
> > > sync IO. This is good, but not quite enough. Normally sync IOs come
> > > one by one, with some small idle time window in between. If we only
> > > start dispatching async IOs after the last sync IO has completed for
> > > eg. 1ms, then we may stop the async background write IOs when there
> > > are active sync foreground read IO stream.
> > >
> > > This simple patch aims to address the writes-push-aside-reads problem.
> > > Ralf, you can try applying this patch and run your workload with this
> > > (huge) CFQ parameter:
> > >
> > > echo 1000 > /sys/block/sda/queue/iosched/slice_sync
> > >
> > > The patch is based on 2.6.30, but can be trivially backported if you
> > > want to use some old kernel.
> > >
> > > It may impact overall (sync+async) IO throughput when there are one or
> > > more ongoing sync IO streams, so requires considerable benchmarks and
> > > adjustments.
> > >
> > > Thanks,
> > > Fengguang
> > > ---
> > >
> > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > > index a55a9bd..14011b7 100644
> > > --- a/block/cfq-iosched.c
> > > +++ b/block/cfq-iosched.c
> > > @@ -1064,7 +1064,6 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> > > if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
> > > return;
> > >
> > > - WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
> > > WARN_ON(cfq_cfqq_slice_new(cfqq));
> > >
> > > /*
> > > @@ -2175,8 +2174,6 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> > > * or if we want to idle in case it has no pending requests.
> > > */
> > > if (cfqd->active_queue == cfqq) {
> > > - const bool cfqq_empty = RB_EMPTY_ROOT(&cfqq->sort_list);
> > > -
> > > if (cfq_cfqq_slice_new(cfqq)) {
> > > cfq_set_prio_slice(cfqd, cfqq);
> > > cfq_clear_cfqq_slice_new(cfqq);
> > > @@ -2190,8 +2187,8 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
> > > */
> > > if (cfq_slice_used(cfqq) || cfq_class_idle(cfqq))
> > > cfq_slice_expired(cfqd, 1);
> > > - else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
> > > - sync && !rq_noidle(rq))
> > > + else if (sync && !rq_noidle(rq) &&
> > > + !cfq_close_cooperator(cfqd, cfqq, 1))
> > > cfq_arm_slice_timer(cfqd);
> > > }
> >
> > What's the purpose of this patch? If you have requests pending you don't
> > want to arm the idle timer and wait, you want to dispatch those.
>
> You are right, please ignore this mindless hacking patch.
>
> Ralf, you can do the read/write ratio in the CFQ scheduler by tuning
> the slice_sync/slice_async parameters.
>
> For example,
>
> echo 10 > /sys//block/sda/queue/iosched/slice_async
> echo 100 > /sys//block/sda/queue/iosched/slice_sync
>
> gives
>
> -dsk/total-
> read writ
> 66M 25M
> 65M 20M
> 49M 32M
> 84M 19M
> 46M 28M
> 61M 23M
> 55M 25M
> 67M 23M
> 76M 18M
> 46M 31M
> 56M 29M
> 54M 23M
> 76M 20M


writing:

--dsk/md1--
_read _writ
0 150M
0 142M
0 143M
0 112M
0 141M
0 152M
0 132M
0 123M
0 149M


reading:

--dsk/md1--
_read _writ
143M 0
145M 0
160M 0
128M 0
148M 0
140M 0
158M 0
130M 0
122M 0

reading + writing:

--dsk/md1--
_read _writ
55M 76M
41M 83M
64M 81M
64M 83M
63M 68M
56M 117M
41M 61M
64M 87M
64M 69M
61M 87M
67M 81M
64M 33M
63M 68M
56M 76M



> while
>
> echo 10 > /sys//block/sda/queue/iosched/slice_async
> echo 300 > /sys//block/sda/queue/iosched/slice_sync
>
> gives
>
> -dsk/total-
> read writ
> 102M 11M
> 82M 10M
> 100M 12M
> 86M 10M
> 95M 11M
> 102M 3168k
> 96M 11M
> 88M 10M
> 96M 12M
>
> However too large slice_sync may not be desirable.

writing:

--dsk/md1--
_read _writ
0 131M
0 136M
0 145M
0 136M
0 128M
0 150M
0 127M
0 149M
0 127M
0 156M
0 125M
0 142M

reading:

--dsk/md1--
_read _writ
128M 0
160M 0
128M 0
128M 0
160M 0
128M 0
109M 0
128M 0
128M 0
160M 0
128M 0


writing:

--dsk/md1--
_read _writ
0 183M
0 142M
0 137M
0 147M
0 135M
0 147M
0 117M
0 135M
0 156M
0 120M
0 147M
0 135M

reading + writing:

--dsk/md1--
_read _writ
96M 40M
64M 38M
96M 29M
96M 24M
96M 31M
95M 35M
97M 26M
96M 23M
96M 33M
95M 73M
91M 25M


Thanks, this seem to be what I was looking for. I'll change the scheduler
parameter for all spool devices and will run a backup with two concurrent
backups. This will show me if bacula behaves the same as the simple dd test
does.


Ralf
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/