Re: Random file I/O regressions in 2.6

From: Nick Piggin
Date: Mon May 03 2004 - 06:21:17 EST


Alexey Kopytov wrote:
Hello!

I tried to compare random file I/O performance in 2.4 and 2.6 kernels and found some regressions that I failed to explain. I tested 2.4.25, 2.6.5-bk2 and 2.6.6-rc3 with my own utility SysBench which was written to generate workloads similar to a database under intensive load.

For 2.6.x kernels anticipatory, deadline, CFQ and noop I/O schedulers were
tested with AS giving the best results for this workload, but it's still about 1.5 times worse than the results for 2.4.25 kernel.

The SysBench 'fileio' test was configured to generate the following workload:
16 worker threads are created, each running random read/write file requests in
blocks of 16 KB with a read/write ratio of 1.5. All I/O operations are evenly
distributed over 128 files with a total size of 3 GB. Each 100 requests, an
fsync() operations is performed sequentially on each file. The total number of
requests is limited by 10000.

The FS used for the test was ext3 with data=ordered.


I am able to reproduce this here. 2.6 isn't improved by increasing
nr_requests, relaxing IO scheduler deadlines, or turning off readahead.
It looks like 2.6 is submitting a lot of the IO in 4KB sized requests...

Hmm, oh dear. It looks like the readahead logic shat itself and/or
do_generic_mapping_read doesn't know how to handle multipage reads
properly.

What ends up happening is that readahead gets turned off, then the
16K read ends up being done in 4 synchronous 4K chunks. Because they
are synchronous, they have no chance of being merged with one another
either.

I have attached a proof of concept hack... I think what should really
happen is that page_cache_readahead should be taught about the size
of the requested read, and ensures that a decent amount of reading is
done while within the read request window, even if
beyond-request-window-readahead has been previously unsuccessful.

Numbers with an IDE disk, 256MB ram
2.4.24: 81s
2.6.6-rc3-mm1: 126s
rc3-mm1+patch: 87s

The small remaining regression might be explained by 2.6's smaller
nr_requests, IDE driver, io scheduler tuning, etc.

Here are the results (values are number of seconds to complete the test):

2.4.25: 77.5377

2.6.5-bk2(noop): 165.3393
2.6.5-bk2(anticipatory): 118.7450
2.6.5-bk2(deadline): 130.3254
2.6.5-bk2(CFQ): 146.4286

2.6.6-rc3(noop): 164.9486
2.6.6-rc3(anticipatory): 125.1776
2.6.6-rc3(deadline): 131.8903
2.6.6-rc3(CFQ): 152.9280

I have published the results as well as the hardware and kernel setups at the
SysBench home page: http://sysbench.sourceforge.net/results/fileio/

Any comments or suggestions would be highly appreciated.


From your website:
"Another interesting fact is that AS gives the best results for this
workload, though it's believed to give worse results for this kind of
workloads as compared to other I/O schedulers available in 2.6.x
kernels."

The anticipatory scheduler is actually in a fairly good state of tune,
and can often beat deadline even for random read/write/fsync tests. The
infamous database regression problem is when this sort of workload is
combined with TCQ disk drives.

Nick include/linux/mm.h | 0
linux-2.6-npiggin/mm/filemap.c | 5 ++++-
2 files changed, 4 insertions(+), 1 deletion(-)

diff -puN mm/readahead.c~read-populate mm/readahead.c
diff -puN mm/filemap.c~read-populate mm/filemap.c
--- linux-2.6/mm/filemap.c~read-populate 2004-05-03 19:56:00.000000000 +1000
+++ linux-2.6-npiggin/mm/filemap.c 2004-05-03 20:51:37.000000000 +1000
@@ -627,6 +627,9 @@ void do_generic_mapping_read(struct addr
index = *ppos >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;

+ force_page_cache_readahead(mapping, filp, index,
+ max_sane_readahead(desc->count >> PAGE_CACHE_SHIFT));
+
for (;;) {
struct page *page;
unsigned long end_index, nr, ret;
@@ -644,7 +647,7 @@ void do_generic_mapping_read(struct addr
}

cond_resched();
- page_cache_readahead(mapping, ra, filp, index);
+ page_cache_readahead(mapping, ra, filp, index + desc->count);

nr = nr - offset;
find_page:
diff -puN include/linux/mm.h~read-populate include/linux/mm.h

_