Re: designing fast, streaming disk i/o with mmap: help wanted

From: Paul Barton-Davis (pbd@Op.Net)
Date: Sun Apr 02 2000 - 18:09:54 EST


OK, the design story continue.

Here are some numbers. Each line has this format:

<file-id>: lock <num-bytes> (<start-byte> .. <end-byte>) = <msecs>

Each <file-id> has been preallocated at 23040044 bytes, and is mmapped
into the process address space.

The lines show the amount of time that mlock(2) took to return when
asked to lock the specified bytes into memory. Note than munlock(2) is
active as well, so the total amount of memory being locked at any one
time stays more or less active.

For the most part, things look good. The performance is pretty much
exactly what I would expect from other tests of my disk
performance. The basic number to spot is roughly 16ms to mlock()
(i.e. seek and read into the buffer cache) 64K of data. Sometimes, its
a lot less (0.1msec or less) if the data is already there thanks to
read-ahead or fortuitous hang-overs from prior runs. Here is an
example:

0: lock 65536 (1015852 .. 1081388) = 12.94msecs
1: lock 65536 (1015852 .. 1081388) = 14.64msecs
2: lock 65536 (1015852 .. 1081388) = 18.80msecs
3: lock 65536 (1015852 .. 1081388) = 10.40msecs
4: lock 65536 (1015852 .. 1081388) = 10.67msecs
5: lock 65536 (1015852 .. 1081388) = 11.78msecs
6: lock 65536 (1015852 .. 1081388) = 17.26msecs
7: lock 65536 (1015852 .. 1081388) = 16.70msecs
8: lock 65536 (1015852 .. 1081388) = 10.41msecs
9: lock 65536 (1015852 .. 1081388) = 10.21msecs
10: lock 65536 (1015852 .. 1081388) = 11.94msecs
11: lock 65536 (1015852 .. 1081388) = 10.56msecs
12: lock 65536 (1015852 .. 1081388) = 18.21msecs
13: lock 65536 (1048620 .. 1114156) = 10.42msecs
14: lock 65536 (1048620 .. 1114156) = 13.99msecs
15: lock 65536 (1048620 .. 1114156) = 0.10msecs
16: lock 65536 (1048620 .. 1114156) = 0.09msecs
17: lock 65536 (1048620 .. 1114156) = 0.09msecs
18: lock 65536 (1048620 .. 1114156) = 0.09msecs
19: lock 65536 (1048620 .. 1114156) = 0.10msecs
20: lock 65536 (1048620 .. 1114156) = 0.08msecs
21: lock 65536 (1048620 .. 1114156) = 0.08msecs
22: lock 65536 (1048620 .. 1114156) = 0.09msecs
23: lock 65536 (1048620 .. 1114156) = 0.08msecs

This is just fine. But as we move into the file a lot further, and
really start doing full-time disk I/O, I start to see this:

0: lock 65536 (4948012 .. 5013548) = 13.33msecs
1: lock 65536 (4948012 .. 5013548) = 16.13msecs
2: lock 65536 (4948012 .. 5013548) = 14.79msecs
3: lock 65536 (4948012 .. 5013548) = 73.73msecs <<<< !!!
4: lock 98304 (4948012 .. 5046316) = 58.68msecs <<<< !!!
5: lock 98304 (4948012 .. 5046316) = 18.43msecs
6: lock 98304 (4948012 .. 5046316) = 17.48msecs
7: lock 98304 (4948012 .. 5046316) = 16.70msecs
8: lock 65536 (4980780 .. 5046316) = 10.31msecs
9: lock 65536 (4980780 .. 5046316) = 12.73msecs
10: lock 65536 (4980780 .. 5046316) = 18.65msecs
11: lock 65536 (4980780 .. 5046316) = 10.23msecs
12: lock 98304 (4980780 .. 5079084) = 10.06msecs
13: lock 98304 (4980780 .. 5079084) = 14.21msecs
14: lock 98304 (4980780 .. 5079084) = 12.60msecs
15: lock 98304 (4980780 .. 5079084) = 13.74msecs
16: lock 98304 (4980780 .. 5079084) = 12.91msecs
17: lock 98304 (4980780 .. 5079084) = 12.30msecs
18: lock 98304 (4980780 .. 5079084) = 12.43msecs
19: lock 98304 (4980780 .. 5079084) = 28.62msecs
20: lock 65536 (5013548 .. 5079084) = 12.53msecs
21: lock 65536 (5013548 .. 5079084) = 12.35msecs
22: lock 65536 (5013548 .. 5079084) = 11.75msecs
23: lock 65536 (5013548 .. 5079084) = 11.13msecs

All of a sudden, some files are taking up to 5 times as long to read
the data.

So, now a word about these files. They were created on an otherwise
empty, fresh, squeaky clean ext2 partition, #2 on the following disk
(c/o fdisk):

  Disk /dev/sdb: 255 heads, 63 sectors, 554 cylinders
  Units = cylinders of 16065 * 512 bytes

     Device Boot Start End Blocks Id System
  /dev/sdb1 1 319 2562336 83 Linux
  /dev/sdb2 320 415 771120 83 Linux
  /dev/sdb3 416 511 771120 83 Linux
  /dev/sdb4 512 554 345397+ 83 Linux

My assumptions were:

   1) each file is contiguous (they were created serially)
   2) the partition is as physically contiguous as it could be

I therefore anticipated that a serial read through all the files would
give rise to roughly uniform seek+read times. Instead, we have
variances of up to 5 fold.

So, my question:

    * why is the mlock/seek+read time so much larger ? what kinds of
      disk geometry issues or kernel buffer cache issues
      could give rise to this kind of behaviour ? is there anything
      i can do about it ?

Needless to say, this sucks. The thread doing the reading has the task
of keeping the far edge of the locked region a certain distance ahead
of another continuously moving point being updated by another
thread. If it takes a performance hit on one mlock() call, then later
ones end up having to handle a longer segment of the file, and thus
get progressively slower, so that eventually, the "read ahead" falls
behind the "reader", and catastrophe ensues. This occurs regardless of
how much read ahead I use, because once things start going downhill,
they just keep sliding. Using more read ahead just means that it takes
longer to hit bottom, so to speak.

Finally, where do I read about rawio ? :)

--p

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Apr 07 2000 - 21:00:09 EST