Re: Linux 2.6.29
From: Jeff Garzik
Date: Thu Apr 02 2009 - 22:01:43 EST
Linus Torvalds wrote:
Feel free to give it a try. It _should_ maintain good write speed while
not disturbing the system much. But I bet if you added the "fadvise()" it
would disturb things even _less_.
My only point is really that you _can_ do streaming writes well, but at
the same time I do think the kernel makes it too hard to do it with
"simple" applications. I'd love to get the same kind of high-speed
streaming behavior by just doing a simple "dd if=/dev/zero of=bigfile"
And I really think we should be able to.
And no, we clearly are _not_ able to do that now. I just tried with "dd",
and created a 1.7G file that way, and it was stuttering - even with my
nice SSD setup. I'm in my MUA writing this email (obviously), and in the
middle it just totally hung for about half a minute - because it was
obviously doing some fsync() for temporary saving etc while the "sync" was
going on.
With the "overwrite.c" thing, I do get short pauses when my MUA does
something, but they are not the kind of "oops, everything hung for several
seconds" kind.
Attached is my slightly-modified version of overwrite.c, modded to bound
the file size and to use fadvise().
On a 128GB, 3.0 Gbps no-name SATA SSD, x86-64, ext3, 2.6.29 vanilla kernel:
+ ./overwrite -b 3000 /spare/tmp/test.dat
writing 3000 buffers of size 8m
23.429 GB written in 1019.25 (23 MB/s)
real 17m0.211s
user 0m0.028s
sys 1m5.800s
+ ./overwrite -b 3000 -f /spare/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
23.429 GB written in 1060.54 (22 MB/s)
real 17m41.446s
user 0m0.036s
sys 1m9.016s
The most interesting thing I found: the SSD does 80 MB/s for the first
~1 GB or so, then slows down dramatically. After ~2GB, it is down to 32
MB/s. After ~4GB, it reaches a steady speed around 23 MB/s.
--------------------------------------------------
On a 500GB, 3.0Gbps Seagate SATA drive, x86-64, ext3, 2.6.29 vanilla kernel:
+ ./overwrite -b 3000 /garz/tmp/test.dat
writing 3000 buffers of size 8m
23.429 GB written in 539.06 (44 MB/s)
real 9m0.348s
user 0m0.064s
sys 1m2.704s
+ ./overwrite -b 3000 -f /garz/tmp/test.dat
using fadvise()
writing 3000 buffers of size 8m
23.429 GB written in 535.08 (44 MB/s)
real 8m55.971s
user 0m0.044s
sys 1m6.600s
There is a similar performance fall-off for the Seagate, but much less
pronounced:
After 1GB: 52 MB/s
After 2GB: 44 MB/s
After 3GB: steady state
There appears to be a small increase in system time with "-f" (use
fadvise), but I'm guessing time(1) does not really give a good picture
of overall system time used, when you include background VM activity.
Jeff
#define _GNU_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <fcntl.h>
#include <ctype.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <linux/fs.h>
#define BUFSIZE (8*1024*1024ul)
static unsigned int maxbuf = ~0U;
static int do_fadvise;
static void parse_opt(int argc, char **argv)
{
int ch;
while (1) {
ch = getopt(argc, argv, "fb:");
if (ch == -1)
break;
switch (ch) {
case 'f':
do_fadvise = 1;
fprintf(stderr, "using fadvise()\n");
break;
case 'b':
if (atoi(optarg) > 1)
maxbuf = atoi(optarg);
else
fprintf(stderr, "invalid bufcount '%s'\n",
optarg);
break;
default:
fprintf(stderr, "invalid option 0%o (%c)\n",
ch,
isprint(ch) ? ch : '-');
break;
}
}
}
int main(int argc, char **argv)
{
static char buffer[BUFSIZE];
struct timeval start, now;
unsigned int index;
int fd;
parse_opt(argc, argv);
mlockall(MCL_CURRENT | MCL_FUTURE);
fd = open("/dev/urandom", O_RDONLY);
if (read(fd, buffer, BUFSIZE) != BUFSIZE) {
perror("/dev/urandom");
exit(1);
}
close(fd);
fd = open(argv[optind], O_RDWR | O_CREAT, 0666);
if (fd < 0) {
perror(argv[optind]);
exit(1);
}
if (maxbuf != ~0U)
fprintf(stderr, "writing %u buffers of size %lum\n",
maxbuf, BUFSIZE / (1024 * 1024ul));
gettimeofday(&start, NULL);
for (index = 0; index < maxbuf; index++) {
double s;
unsigned long MBps;
unsigned long MB;
if (write(fd, buffer, BUFSIZE) != BUFSIZE)
break;
sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
if (index)
sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
if (do_fadvise)
posix_fadvise(fd, (index-1)*BUFSIZE, BUFSIZE,
POSIX_FADV_DONTNEED);
gettimeofday(&now, NULL);
s = (now.tv_sec - start.tv_sec) + ((double) now.tv_usec - start.tv_usec)/ 1000000;
MB = index * (BUFSIZE >> 20);
MBps = MB;
if (s > 1)
MBps = MBps / s;
printf("%8lu.%03lu GB written in %5.2f (%lu MB/s) \r",
MB >> 10, (MB & 1023) * 1000 >> 10, s, MBps);
fflush(stdout);
}
close(fd);
printf("\n");
return 0;
}