Re: Linux 2.6.29

From: Jeff Garzik
Date: Thu Apr 02 2009 - 22:01:43 EST


Linus Torvalds wrote:
Feel free to give it a try. It _should_ maintain good write speed while not disturbing the system much. But I bet if you added the "fadvise()" it would disturb things even _less_.

My only point is really that you _can_ do streaming writes well, but at the same time I do think the kernel makes it too hard to do it with "simple" applications. I'd love to get the same kind of high-speed streaming behavior by just doing a simple "dd if=/dev/zero of=bigfile"

And I really think we should be able to.

And no, we clearly are _not_ able to do that now. I just tried with "dd", and created a 1.7G file that way, and it was stuttering - even with my nice SSD setup. I'm in my MUA writing this email (obviously), and in the middle it just totally hung for about half a minute - because it was obviously doing some fsync() for temporary saving etc while the "sync" was going on.

With the "overwrite.c" thing, I do get short pauses when my MUA does something, but they are not the kind of "oops, everything hung for several seconds" kind.

Attached is my slightly-modified version of overwrite.c, modded to bound the file size and to use fadvise().

On a 128GB, 3.0 Gbps no-name SATA SSD, x86-64, ext3, 2.6.29 vanilla kernel:

+ ./overwrite -b 3000 /spare/tmp/test.dat
writing 3000 buffers of size 8m
23.429 GB written in 1019.25 (23 MB/s)

real 17m0.211s
user 0m0.028s
sys 1m5.800s


+ ./overwrite -b 3000 -f /spare/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
23.429 GB written in 1060.54 (22 MB/s)

real 17m41.446s
user 0m0.036s
sys 1m9.016s


The most interesting thing I found: the SSD does 80 MB/s for the first ~1 GB or so, then slows down dramatically. After ~2GB, it is down to 32 MB/s. After ~4GB, it reaches a steady speed around 23 MB/s.


--------------------------------------------------

On a 500GB, 3.0Gbps Seagate SATA drive, x86-64, ext3, 2.6.29 vanilla kernel:

+ ./overwrite -b 3000 /garz/tmp/test.dat
writing 3000 buffers of size 8m
23.429 GB written in 539.06 (44 MB/s)

real 9m0.348s
user 0m0.064s
sys 1m2.704s


+ ./overwrite -b 3000 -f /garz/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
23.429 GB written in 535.08 (44 MB/s)

real 8m55.971s
user 0m0.044s
sys 1m6.600s


There is a similar performance fall-off for the Seagate, but much less pronounced:
After 1GB: 52 MB/s
After 2GB: 44 MB/s
After 3GB: steady state




There appears to be a small increase in system time with "-f" (use fadvise), but I'm guessing time(1) does not really give a good picture of overall system time used, when you include background VM activity.

Jeff




#define _GNU_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <ctype.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <linux/fs.h>

#define BUFSIZE (8*1024*1024ul)

static unsigned int maxbuf = ~0U;
static int do_fadvise;

static void parse_opt(int argc, char **argv)
{
int ch;

while (1) {
ch = getopt(argc, argv, "fb:");
if (ch == -1)
break;

switch (ch) {
case 'f':
do_fadvise = 1;
fprintf(stderr, "using fadvise()\n");
break;
case 'b':
if (atoi(optarg) > 1)
maxbuf = atoi(optarg);
else
fprintf(stderr, "invalid bufcount '%s'\n",
optarg);
break;
default:
fprintf(stderr, "invalid option 0%o (%c)\n",
ch,
isprint(ch) ? ch : '-');
break;
}
}
}

int main(int argc, char **argv)
{
static char buffer[BUFSIZE];
struct timeval start, now;
unsigned int index;
int fd;

parse_opt(argc, argv);

mlockall(MCL_CURRENT | MCL_FUTURE);
fd = open("/dev/urandom", O_RDONLY);
if (read(fd, buffer, BUFSIZE) != BUFSIZE) {
perror("/dev/urandom");
exit(1);
}
close(fd);

fd = open(argv[optind], O_RDWR | O_CREAT, 0666);
if (fd < 0) {
perror(argv[optind]);
exit(1);
}

if (maxbuf != ~0U)
fprintf(stderr, "writing %u buffers of size %lum\n",
maxbuf, BUFSIZE / (1024 * 1024ul));

gettimeofday(&start, NULL);
for (index = 0; index < maxbuf; index++) {
double s;
unsigned long MBps;
unsigned long MB;

if (write(fd, buffer, BUFSIZE) != BUFSIZE)
break;
sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
if (index)
sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
if (do_fadvise)
posix_fadvise(fd, (index-1)*BUFSIZE, BUFSIZE,
POSIX_FADV_DONTNEED);
gettimeofday(&now, NULL);
s = (now.tv_sec - start.tv_sec) + ((double) now.tv_usec - start.tv_usec)/ 1000000;

MB = index * (BUFSIZE >> 20);
MBps = MB;
if (s > 1)
MBps = MBps / s;
printf("%8lu.%03lu GB written in %5.2f (%lu MB/s) \r",
MB >> 10, (MB & 1023) * 1000 >> 10, s, MBps);
fflush(stdout);
}
close(fd);
printf("\n");
return 0;
}