Re: Buffered I/O slowness

From: Jesse Barnes
Date: Fri Oct 29 2004 - 12:52:05 EST


On Monday, October 25, 2004 6:14 pm, Jesse Barnes wrote:
> I've been doing some simple disk I/O benchmarking with an eye towards
> improving large, striped volume bandwidth. I ran some tests on individual
> disks and filesystems to establish a baseline and found that things
> generally scale quite well:
>
> o one thread/disk using O_DIRECT on the block device
> read avg: 2784.81 MB/s
> write avg: 2585.60 MB/s
>
> o one thread/disk using O_DIRECT + filesystem
> read avg: 2635.98 MB/s
> write avg: 2573.39 MB/s
>
> o one thread/disk using buffered I/O + filesystem
> read w/default (128) block/*/queue/read_ahead_kb avg: 2626.25 MB/s
> read w/max (4096) block/*/queue/read_ahead_kb avg: 2652.62 MB/s
> write avg: 1394.99 MB/s
>
> Configuration:
> o 8p sn2 ia64 box
> o 8GB memory
> o 58 disks across 16 controllers
> (4 disks for 10 of them and 3 for the other 6)
> o aggregate I/O bw available is about 2.8GB/s
>
> Test:
> o one I/O thread per disk, round robined across the 8 CPUs
> o each thread did ~450MB of I/O depending on the test (ran for 10s)
> Note: the total was > 8GB so in the buffered read case not everything
> could be cached

More results here. I've run some tests on a large dm striped volume formatted
with XFS. It had 64 disks with a 64k stripe unit (XFS was made aware of this
at format time), and I explicitly set the readahead using blockdev to 524288
blocks. The results aren't as bad as my previous runs, but are still much
slower than they ought to be I think given the direct I/O results above.
This is after a fresh mount, so the pagecache was empty when I started the
tests.

o one thread on one large volume using buffered I/O + filesystem
read (1 thread, one volume, 131072 blocks/request) avg: ~931 MB/s
write (1 thread, one volume, 131072 blocks/request) avg: ~908 MB/s

I'm intentionally issuing very large reads and writes here to take advantage
of the striping, but it looks like both the readahead and regular buffered
I/O code will split the I/O into page sized chunks? The call chain is pretty
long, but it looks to me like do_generic_mapping_read() will split the reads
up by page and issue them independently to the lower levels. In the direct
I/O case, up to 64 pages are issued at a time, which seems like it would help
throughput quite a bit. The profile seems to confirm this. Unfortunately I
didn't save the vmstat output for this run (and now the fc switch is
misbehaving so I have to fix that before I run again), but iirc the system
time was pretty high given that only one thread was issuing I/O.

So maybe a few things need to be done:
o set readahead to larger values by default for dm volumes at setup time
(the default was very small)
o maybe bypass readahead for very large requests?
if the process is doing a huge request, chances are that readahead won't
benefit it as much as a process doing small requests
o not sure about writes yet, I haven't looked at that call chain much yet

Does any of this sound reasonable at all? What else could be done to make the
buffered I/O layer friendlier to large requests?

Thanks,
Jesse
115383 total 0.0203
49642 ia64_pal_call_static 258.5521
42065 default_idle 93.8951
7348 __copy_user 3.1030
5865 ia64_save_scratch_fpregs 91.6406
5766 ia64_load_scratch_fpregs 90.0938
1944 _spin_unlock_irq 30.3750
352 _spin_unlock_irqrestore 3.6667
231 buffered_rmqueue 0.1245
225 kmem_cache_free 0.5859
151 mpage_end_io_read 0.2776
147 __end_that_request_first 0.1242
133 bio_alloc 0.0967
122 smp_call_function 0.1089
102 shrink_list 0.0224
99 unlock_page 0.4420
86 free_hot_cold_page 0.0840
82 kmem_cache_alloc 0.3203
65 __alloc_pages 0.0271
53 do_mpage_readpage 0.0224
53 bio_clone 0.1380
49 __might_sleep 0.0806
44 mpage_readpages 0.0598
43 generic_make_request 0.0345
42 sn_pci_unmap_sg 0.1010
42 sn_dma_flush 0.0597
41 clear_page 0.2562
40 file_read_actor 0.0431
34 mark_page_accessed 0.0966
32 __bio_add_page 0.0278