Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

From: Corrado Zoccolo
Date: Sun Apr 26 2009 - 08:43:56 EST

Next message: Matt Domsch: "Re: Class device namespaces"
Previous message: Rusty Russell: "Re: [PATCH] lib: find_last_bit.o needed by a module only, move it from lib to obj"
In reply to: Corrado Zoccolo: "Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jens,
I found fio, a very handy and complete tool for block I/O performance
testing (kudos to the author), and started doing some thorough testing
for the patch, since I couldn't tune tiotest behaviour, and I had only
a surface understanding of what was going on.
The test configuration is attached for reference. Each test is run
after dropping the caches. The suffix .2 or .3 means the value for
{writes,async}_starved tunable.

My findings are interesting:
* there is a definite improvement for many readers performing random
reads and one sequential writer (this was the one I think tiotest was
showing, due to unclear separation -i.e. no fsync- between tiotest
phases). This workload simulates boot for a single disk machine, with
random reads that represent fault-ins for binaries and libraries, and
sequential writes that represent log updates.

* the improvement is not present if the number of readers is small
(e.g. 4). It gets performance similar to original deadline, that is
far below cfq. The problem appears to be caused by the unfairness for
low-numbered sectors, and happens only when the random readers have
overlapping reading regions. Let's assume 4 readers as in my test.
The workload evolution will be: a read batch is started, from the
request that is first in FIFO. The probability that the batch starts
at the first read in disk order is 1/4, and the probability that it
will be first in next second is 7/24 (assuming the first reader
doesn't post a new request yet). This means there is 11/24 of
probability that we need more than 2 batches to service all initial
read requests (and then we will service the starved writer: increasing
writes_starved in fact improves the reader bw). If the reading regions
overlap, after the writer is serviced, the FIFO will still be randomly
ordered, so the same pattern will repeat.
A perfect scheduler, instead, for each batch in which fewer than
fifo_batch requests are available, should schedule all the read
requests available, i.e. start from the first in disk order instead of
the first in FIFO order (unless a deadline is expired).
This allows the readers to progress much faster. Do you want me to
test such heuristic?
** I think there is also an other theoretical bad case in deadline
behaviour, i.e. when all deadlines expire. In that case, it switches
to complete FIFO batch scheduling. In this case, instead. scheduling
all requests in disk order will allow for a faster recovery. Do you
think we should handle also this case?

* I think that now that we differentiate between sync and async
writes, we can painlessly increase the async_starved tunable. This
will provide better performance for mixed workloads as random readers
mixed with sequential writer. In particular, the 32 readers/1writer
test shows impressive performance, where full write bandwidth is
achieved, while reader bandwidth outperforms all other schedulers,
including cfq (that instead completely starve the writer).

* on my machine, there is a regression on sequential write (2 parallel
sequential writers, instead, give better performance, and 1 seq writer
mixed with many random readers max out the write bandwidth).
Interestingly, this regression disappears when I spread some printks
around. It is therefore a timing issue, that causes less merges to
happen (I think this can be fixed allowing async writes to be
scheduled only after an initial delay):
# run with printks #

seqwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=psync, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [F] [100.0% done] [ 0/ 0 kb/s] [eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=4838
write: io=1010MiB, bw=30967KiB/s, iops=7560, runt= 34193msec
clat (usec): min=7, max=4274K, avg=114.53, stdev=13822.22
cpu : usr=1.44%, sys=9.47%, ctx=1299, majf=0, minf=154
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w: total=0/258510, short=0/0
lat (usec): 10=45.00%, 20=52.36%, 50=2.18%, 100=0.05%, 250=0.32%
lat (usec): 500=0.03%, 750=0.01%, 1000=0.01%
lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 100=0.01%, 250=0.02%
lat (msec): 500=0.01%, 2000=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
WRITE: io=1010MiB, aggrb=30967KiB/s, minb=30967KiB/s,
maxb=30967KiB/s, mint=34193msec, maxt=34193msec

Disk stats (read/write):
sda: ios=35/8113, merge=0/250415, ticks=2619/4277418,
in_queue=4280032, util=96.52%

# run without printks #
seqwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=psync, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [F] [100.0% done] [ 0/ 0 kb/s] [eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=5311
write: io=897076KiB, bw=26726KiB/s, iops=6524, runt= 34371msec
clat (usec): min=7, max=1801K, avg=132.11, stdev=6407.61
cpu : usr=1.14%, sys=7.84%, ctx=1272, majf=0, minf=318
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w: total=0/224269, short=0/0
lat (usec): 10=49.04%, 20=49.05%, 50=1.17%, 100=0.07%, 250=0.51%
lat (usec): 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec): 2=0.01%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec): 100=0.05%, 250=0.03%, 500=0.01%, 2000=0.01%

Run status group 0 (all jobs):
WRITE: io=897076KiB, aggrb=26726KiB/s, minb=26726KiB/s,
maxb=26726KiB/s, mint=34371msec, maxt=34371msec

Disk stats (read/write):
sda: ios=218/7041, merge=0/217243, ticks=16638/4254061,
in_queue=4270696, util=98.92%

Corrado

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoccolo@xxxxxxxxx
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

Attachment: test2.fio
Description: Binary data

Attachment: deadline-iosched-orig.2
Description: Binary data

Attachment: deadline-iosched-patched.2
Description: Binary data

Attachment: deadline-iosched-orig.3
Description: Binary data

Attachment: deadline-iosched-patched.3
Description: Binary data

Attachment: cfq
Description: Binary data

Next message: Matt Domsch: "Re: Class device namespaces"
Previous message: Rusty Russell: "Re: [PATCH] lib: find_last_bit.o needed by a module only, move it from lib to obj"
In reply to: Corrado Zoccolo: "Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]