filesystem performance issues,the write/buffercache dilemma (O_SYNC=horrible performance)

From: Benno Senoner (sbenno@gardena.net)
Date: Wed Apr 19 2000 - 12:40:31 EST


Hi,
after a few discussions with Stepthen Tweedie, how to optimize disk writes when
writing many files simultaneously, decided to benchmark the whole issue.
The initial topics, was how to reduce userspace buffering and/or avoid to flood
the linux disk buffer cache.
The pratical application example was a audio harddisk recorder , recording as
many tracks as possible, while trying to keep the needed userspace buffer per
track as low as possible.
( see http://www.linuxdj.com/hdrbench for some graphs about the bursty
behaviour of the linux disk subsystem)
The debate was single threaded writing vs multithreaded writing
and/or using O_SYNC or not.
Stephen initially suggested to use mulithreading plus O_SYNC in order to avoid
write-behinds plus issue a deep IO request queue, in order to let the disk
driver optimize writes by minimizing disk seeks.

Now my findings: in the average case, a single thread writing 256k+ blocks to
multiple files, achieves a bit better or equal performance than using a
dedicated thread for each file.
I get about 8-9MB/sec out of my IDE disk.
( I am running on an IBM 18GB EIDE 7200rpm disk + dual Celeron 366 on
kernel 2.2.14-smp (stock RH6.2))

Stephen's suggestion tuns out to be a nightmare: using O_SYNC with 20
simultaneous threads give me about 0.2 MB/sec total throughput while testing
the multithreaded mode. I would call this _horrible_
In single threaded mode I get about 2MB/sec which is only 25% of the 8MB/sec
I got without O_SYNC, pratically unusable in both single threaded and
multithreaded mode.

I attached a small benchmarking code which demonstrates this effectively.

Another topic discussed with Stephen was the use of a future O_DIRECT (like SGI
has), which allows one do direct IO from/to userspace , bypassing the
buffer cache.

Today I had the possibility to test this on SGI box: (you have to add the
O_DIRECT stuff + buffer alignment if you want to run my testapp attached below)

results:
16MB/sec sustained in both single and multithreaded mode
3MB/sec in single thread mode using O_SYNC (O_DSYNC doesn make any
difference)
6MB/sec using multithreaded mode + O_SYNC (20 threads)

O_DIRECT:
11MB/sec both in single and multithreaded mode.
I even tried to increase buffersizes up to 1MB , but this number did not
increase.

that is a 30% performance hit, which hurts, when you need the raw speed.

Stephen, do you think the same could happen in the Linux case ?

That means that both on the SGI and on Linux the best method to achieve the
highest write performance is to use plain buffered output.
 
It seems quite obvious that the buffered approach can optimize the write
requests by minimizing disk seeks , since it has a deep request queue,
while O_DIRECT or O_SYNC are just forced to follow the application's commands.

My main concern was amount of buffer needed by the userspace app, since
when doing harddisk recording we write up to 50 tracks and 2-4MB of buffer per
track (to overcome filesystem latencies), would mean about 200MB of RAM thrown
away for the mere buffering.

Do you think that it will be possible to achieve lower filesystem latencies
under Linux (that the read() or write() call doesn't stall for 8secs!), with
some modifications to the disk I/O subsystem.
Notice that I did all my testing on an EIDE 18GB UDMA 7200rpm disk.
(or does perhaps Andrea's evevator patch help in these situations ?)
I haven't tested it yet.
With SCSI operations may be smoother.

Some people are saying that on Windows you can do many audioo tracks
on HD recording apps, by using only 256KB per track without getting buffer
overruns. (on Linux 1MB (ringbuffer) is about the minimum when you want to get
high throughtput )
Maybe I am missing something obvious.

comments ?
Benno.

---------------------------------------------------------------------------------
/* hdtest.c
   small async vs O_SYNC / single vs multithreaded disk writing benchmark
   by Benno Senoner (sbenno@gardena.net)

   the program writes NUMFILES (=20) files simultaneously

   compile with : gcc -O2 -o hdtest hdtest.c -lpthread
   run with ./hdtest TOTAL_OUTPUT_MEGABYTES <sync>

   for example: ./hdtest 100 writes a total amount of 100MB of data
 

   for O_SYNC output give 'sync' as 2nd argument ( ./hdtest sync )

*/

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <pthread.h>
#include <time.h>

// IO OUTPUT SIZE , default 256k )
#define MYSIZE (262144*4)

#define NUMFILES 20

#define NUMLOOPS (TOTAL_WRITE_SIZE/(MYSIZE*NUMFILES))

void *writer_thread(void *data);

pthread_mutex_t my_mutex = PTHREAD_MUTEX_INITIALIZER;
int num_active_threads;
int written_bytes;

time_t time1,time2;
char *buf;

int TOTAL_WRITE_SIZE;

void print_status(void);

int main(int argc, char **argv)
{
  int i,u;
  int res;
  int retcode;
  int counter=0;
  int open_flags;

  pthread_t my_thread[NUMFILES];

  char filename[200];
  int fds[NUMFILES];

  buf=(char *)malloc(MYSIZE);
  for(i=0;i<MYSIZE;i++) {
    buf[i]=0;
  }

  open_flags=O_WRONLY|O_TRUNC|O_CREAT;

  TOTAL_WRITE_SIZE=100*1024*1024;
  if(argc >=2) {
    TOTAL_WRITE_SIZE=1024*1024*atoi(argv[1]);
  }
  TOTAL_WRITE_SIZE=MYSIZE*NUMFILES*NUMLOOPS;
  printf("TOTAL WRITE SIZE=%d\n",TOTAL_WRITE_SIZE);
  
  

  if(argc >=3) {
    if(!strcmp(argv[2],"sync")) {
      open_flags |= O_SYNC;
      printf("opening in files in O_SYNC mode\n");
    }
  }
  

  printf("opening files");
  for(i=0;i<NUMFILES;i++) {
    printf(".");
    fflush(stdout);
    sprintf(filename,"outfile%d",i);
    fds[i]=open(filename, open_flags, 0777);
    if(fds[i]<0) {
    fprintf(stderr,"ERROR while opening file '%s'\n",filename);
        perror("ERROR: open");
    }
  }
  printf("\n");
  sync();
  sleep(2);

  printf("SINGLE THREADED WRITING START\n");
  time(&time1);

  written_bytes=0;
  for(u=0;u<NUMLOOPS;u++) {
    for(i=0;i<NUMFILES;i++) {
      res=write(fds[i],buf,MYSIZE);
      if(res != MYSIZE) {
        printf("write ERROR: res=%d\n",res);
        perror("write");
        exit(1);
      }
      written_bytes += MYSIZE;
      if((counter++ & 7)==0) {
        print_status();
      }
    }
  }

  for(i=0;i<NUMFILES;i++) {
    close(fds[i]);
  }
  sync();
  time(&time2);
  printf("SINGLE THREADED WRITING END\n");
  printf("ELAPSED TIME=%d\n",time2-time1);
  printf("EFFECTIVE SPEED=%.3f MByte/sec\n",(double)TOTAL_WRITE_SIZE/(double)(time2-time1)/1000000.0);
  sleep(4);

  printf("opening files");
  for(i=0;i<NUMFILES;i++) {
    sprintf(filename,"outfile%d",i);
    printf(".");
    fflush(stdout);
    fds[i]=open(filename, open_flags, 0777);
    if(fds[i]<0) {
    fprintf(stderr,"ERROR while opening file '%s'\n",filename);
        perror("ERROR: open");
    }
  }
  sync();
  sleep(2);
  printf("\n");

  printf("MULTI-THREADED WRITING START\n");
  time(&time1);
  written_bytes=0;

  num_active_threads=NUMFILES;
  for(i=0;i<NUMFILES;i++) {
    retcode=pthread_create(&my_thread[i],NULL,writer_thread,&fds[i]);
    if(retcode) {
      fprintf(stderr,"pthread_create of thread %d failed %d\n",i,retcode);
      exit(1);
    }
  }

  printf("ALL THREADS FIRED UP, PARENT WAITING FOR FINISH ....\n");

  while(1) {
    sleep(1);
    print_status();
    if(num_active_threads == 0) break;
  }
  sync();
  time(&time2);
  printf("ELAPSED TIME=%d\n",time2-time1);
  printf("EFFECTIVE SPEED=%.3f MByte/sec\n",(double)TOTAL_WRITE_SIZE/(double)(time2-time1)/1000000.0);

  printf("MULTI-THREADED WRITING END\n");

  printf("PARENT FINISH.\n");
 

}

void *writer_thread(void *data) {

  int myfd;
  int i,res;

  for(i=0;i<MYSIZE;i++) {
    buf[i]=0;
  }

  myfd=*(int *)data;
//printf("writer_thread: myfd=%d\n",myfd);

  for(i=0;i<NUMLOOPS;i++) {
    res=write(myfd,buf,MYSIZE);
    if(res != MYSIZE) {
      printf("write ERROR: res=%d\n",res);
      perror("write");
      exit(1);
    }
    pthread_mutex_lock(&my_mutex);
    written_bytes += MYSIZE;
    pthread_mutex_unlock(&my_mutex);
    //if((i & 15)==0) printf("wrote: fd=%d block %d\n",myfd,i);
  }
  close(myfd);
  pthread_mutex_lock(&my_mutex);
  num_active_threads--;
  pthread_mutex_unlock(&my_mutex);
  return(NULL);
}

void print_status(void) {

  double percent;
  double data_rate;
  time_t time3;

  time(&time3);

  percent=(double)written_bytes/(double)TOTAL_WRITE_SIZE*100.0;
  data_rate=(double)written_bytes/(double)(time3-time1)/1000000.0;

printf("wrote bytes=%d done %.1f%% RATE=%.3f MByte/sec %c",written_bytes,percent,data_rate,13);
fflush(stdout);
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 23 2000 - 21:00:15 EST