(disk/cpu) kernel performance problem

Gerald Aigner (aigner@stanford.edu)
Mon, 26 Jul 1999 11:10:28 -0700 (PDT)


Hi!

I am currently in the process of optimizing a CPU and disk intensive
database program. My test setup is a 400 MHz Intel system running
the linux kernel 2.2.7 with IBM 7200rpm IDE disks (DMA enabled)
attached to it.

While optimizing the program I discovered that disk I/O performance in
Linux could be improved substantially. In particular, there are two
main problems:

1) The kernel copies data many times before it ends up in user
space.

2) Using mmap doesn't help because it serializes I/O within a single
process (you can't read simultaneously from two disks).

I hope that the problem description below alerts the Linux community
to the problem and helps to make Linux the best server operating
system available.

Thanks for writing the Linux system,
Gerald

=================== detailed description ====================

Looking into the problem I found that reading large data blocks from
two IDE disks with the read(2) library/system call consumes about 45%
of the available CPU time when reading roughly 30 Mbytes/sec.

I briefly looked into the kernel and think that most of these 45% of
execution time is spend in copying the data just read in from buffer
to buffer until it reaches the user address space (since I don't have
any kernel hacking experience I couldn't figure out how many buffers
exactly were involved). Could anyone shed light into that? It would
be really interesting.

As far as I understand the simplest and fasted implementation involves
the following operations:
1) a read or mmap request translates the data requested into 4KB
blocks
2) the file system then requests these 4KB blocks from the system
block-device cache
3) if the data cannot be found in the block-device cache the
cache requests the data from the device driver
4) finally the file system maps the data from the block-device
cache into user space with copy-on-write [or in the case of
a mmap() simply maps it into user memory]

In other words, for read-only access it should not be necessary to
touch the data read in if the user process is written carefully
(i.e., reads page-aligned data into a page-aligned buffer). In other
words, reading at 30Mbytes/sec should consume almost no kernel CPU
time.

Avoiding copying is very important if one considers that the usable
sequential CPU memory bandwidth usually is not higher than 200-300
MB/sec. An example: Assuming that the sustainable disk bandwidth of
two disks is in the neighbourhood of 30 MB/sec, one memory to memory
copy operation already accounts for 20-30 % of CPU time.

After observing the above problem, I then started experimenting with
mmap/mlock in the hope that mmap would give me better performance than
simple read/write commands. Unfortunately mmap/mlock made things
worse. Overall execution time of the simple example program attached
to this email increased substantially. By looking at the kernel sources I
discovered that accesses to mlock() are serialized within a given process,
so that it is impossible to read from more than one disk at once. For my
server application where multiple threads continously read large
amounts of data from disk, this serialization makes mmap() unusable
for high-performance I/O purposes.

- ----- example C++ program -----
measurements:
read system call: 70 seconds
mmap/mlock: 107 seconds

#include <pthread.h>
#include <stdio.h>
#include <time.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

const int MegaByte = 1024*1024;
const int pages = 90;
const size_t size = 10 * MegaByte;

time_t start;

/* with DISK_READ defined, the program uses read(2) calls;
otherwise, it uses mmap(2). */

#define DISK_READ

#ifdef DISK_READ

void* diskThread( void* name ) {
char* fileName = (char*)name;

fprintf(stderr,"Loading file: %s\n", name );
int f = open(fileName, O_RDONLY );
if (!f) {
fprintf(stderr,"Couldn't open file\n");
exit(1);
}

char* buff = new char[ size ];
for ( int i = 0; i < pages; i++ ) {
if ( read( f, buff, size ) != size ) {
fprintf(stderr, "Failed to read: %s %d\n", fileName, i );
exit(2);
}
}
time_t end;
time( &end );
fprintf(stderr,"Seconds elapsed: %d\n", (int)(end-start) );
}

#else

void* diskThread( void* name ) {
char* fileName = (char*)name;

fprintf(stderr,"Loading file: %s\n", name );
int f = open( fileName, O_RDONLY );
if ( !f ) {
fprintf(stderr,"Couldn't open file\n");
exit(1);
}

for ( int i = 0; i < pages; i++ ) {
void* buf = mmap( 0, size, PROT_READ, MAP_SHARED, f, i*size );
if ( buf == MAP_FAILED ) {
fprintf(stderr, "mmap failed\n" );
exit(2);
}
if ( mlock( buf, size ) ) {
fprintf( stderr, "mlock failed\n");
exit(3);
}

// at this point the data is available and will be processed

if ( munlock( buf, size ) ) {
fprintf( stderr, "munlock failed\n");
exit(4);
}
if ( munmap( buf, size ) ) {
fprintf( stderr, "munlock failed\n");
exit(5);
}
}
time_t end;
time(&end);
fprintf(stderr,"time elapsed: %d\n", (int)(end-start) );
return 0;
}

#endif

int main() {
time( &start );
pthread_t thread1;
pthread_t thread2;
pthread_create( &thread1, 0, diskThread, "hda-file");
pthread_create( &thread2, 0, diskThread, "hdc-file");

// wait forever
pthread_mutex_t mutex;
pthread_mutex_init( &mutex, 0);
pthread_mutex_lock( &mutex );
pthread_mutex_lock( &mutex );
return 0;
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/