A peek at the future of storage

From: Daniel Phillips
Date: Wed Dec 12 2007 - 11:46:34 EST


The imminent demise of rotating media storage has been predicted for
many years now, and here is a device that aims to help turn that
fantasy into fact:

http://www.violin-memory.com/products/violin1010.html

Well. At least if you have the money, and are not bothered by the fact
that that big box of ram in the picture (2U) only stores the same
amount of data as a typical $100 sata disk. But if you are the proud
owner of a disk-bound application that cannot be scaled any other way,
a device like this is going to be interesting to you right now. Maybe
in ten years time everybody will have something like this in their
laptop.

The characteristics of a solid state disk such as this one are very
interesting, and needless to say, somewhat different from rotating
media.

Here is the big one:

Seek time = 0

And this one is not so bad:

Transfer latency = a handful of microseconds

And to round it out:

Write bandwidth = about 1,000 MB/sec
Read bandwidth = about 1,500 MB/sec

Violin gets these numbers because their device is essentially a huge
ramdisk connected to a host server by a PCI-e 8x external bus. Well,
this is not just a box of DRAM, it is a box of raided (raid 3!)
hot-swappable DRAM, and it runs Linux as a high level supervisor. So
to put it mildly, this is an interesting piece of equipment.

We had the good fortune to be able to run some preliminary benchmarks on
one of the first of these machines off the production line. As the
Zumastor team (http://zumastor.org), naturally we were interested in
the performance effect on our NAS application, which consists of knfsd
running on top of a ddsnap virtual block device.

Using a run of the mill rackmount server connected to the Violin box, we
found Ext3 capable of roughly 500 MB/Sec write speed and 650 MB/Sec
read, for large linear transfers. So some bandwidth got lost
somewhere, but unfortunately we did not have time to go hunting and see
where it went. Still, hundreds of megabytes of read and write
bandwidth are hardly something to sneeze at. We went on to some higher
level tests.

The thing that interested us most was, what would be the bottom line
effect on NFS serving performance, with and without volume
snapshotting. We planned to compare this to the same machine serving
data off a single sata disk. It would have been nice to compare to,
say, a 5 disk array as well, but unfortunately such a configuration
could not be set up in the time available. Later.

Ddsnap is a seek intensive storage application because it has to
maintain a fairly complex data structure on disk, which may have to be
updated on every write. Those updates have to be durable, so add in
the cost of a journal or moral equivalent. This creates a scary amount
of seek activity under heavy write loads. So what happens on a solid
state disk? Obviously, things improve a lot.

Now a word about how one measures NFS performance. Total throughput
with lots of simulated clients? One would think. But actually, the
fashion is to measure transaction latency for some given number of
transactions per second. Which is logical when you think about it:
total throughput may well increase when you throw more traffic at a
server, but what use is that if latency goes through the roof? To know
more about the esoterica of NFS benchmarking, see here:

http://www.spec.org/sfs97r1/docs/sfs-3.0v11.html

The fstress test we use is an open source effort kindly contributed by
Jeff Chase and Darrell Anderson of Duke University. Thankyou very
much, Jeff and Darrell.

http://www.cs.duke.edu/ari/fstress/

I cannot attest to any particular relationsip between Spec SFS and
fstress results. So please do not compare our fstress numbers to
commercially published Spec SFS results. Though they attempt to
measure much the same thing, the algorithms are not precisely the same,
and that might cause results of fstress to be quite different from Spec
SFS on the same hardware. So, did I say, please do not compare these
results to commercially published Spec SFS results? Thankyou :-)

We ran three fstress tests:

1. NFS served directly from an Ext3 volume
2. NFS served from a ddsnap virtual volume with no snapshots
3. NFS served from a ddsnap virtual volume holding one snapshot

Server Hardware:

HP DL-385 with 8GB of RAM
2 x dual-core opterons 2220
Single SAS disk
Violin 1010 connected via PCI-e 8x
Chelsio 10GigE directly connected to client

Client hardware:

Dell precision 380 with 10GigE

Test results using the Violin SSD device:

http://zumastor.org/graphs/fstress.violin.jpg

We see that at 20,000 NFS operations per second, latency is only 6
milliseconds per NFS operation on raw Ext3 and 9 milliseconds on a
snapshotted virtual volume. Unfortunately, we were unable to test
higher transaction rates this time because instability with the 10 Gige
network connection that could not be tracked down in the time that we
had. Until we have a chance to perform further tests, we can only
guess how high the performance scales before latency goes vertical.

For comparison, we ran the same tests using a single sata disk in place
of the Violin SSD:

http://zumastor.org/graphs/fstress.sata.jpg

Here, network interface instability cut short our fstress runs, so we
only got a few data points for snapshotted volumes. However, as far
as it goes, the relationship between raw, virtual and snapshotted
latency looks similar to the results on the Violin SSD. The raw
results were obtained up to 2,000 operations per second, and there we
already see 160 ms latency. That is 20 times more latency at one tenth
the operations per second. So roughly speaking, the Violin SSD versus
the sata disk speeds the whole system up by a factor of 200. Pretty
cool, hmm?

We learned a lot from these tests. The first and most important news
from our point of view is that snapshotting via ddsnap does not have a
particularly horrible effect on NFS serving performance, either at the
low end or the high end of the hardware performance spectrum. This was
a big relief for us, because we always worried that the copy-on-write
strategy that was adopted for ddsnap can exhibit rather large write
performance artifacts, over 10 times worse performance in some cases
than a raw disk. In practice, the awful write performance of NFS in
general hides a lot of that write slowness, disk cache hides some more,
and the relatively low balance of writes in the fstress algorithm hides
yet more. The gap between snapshotted and unsnapshotted performance
seems to get wider on the SSD if anything, a counterintuitive result
that is possibly explained by data bandwidth considerations as
discussed below.

The next thing we learned is that a solid state disk makes NFS serving
go an awful lot faster, other things being equal. The news here is
that all of the following turned out to scale very well: Ext3, the
block layer, knfsd, networking, device mapper and ddsnap. Personally,
I was quite surprised at how well knfsd scales. It seems to me that
the popular wisdom has always been that our knfsd is something of a
turtle, but when I went to read the code to find out why, I just could
not find any glaring reason why that should be so. And lo, we now see
that it just ain't so. Big kudos to all those who have worked on the
code over the years to turn it into a great performer. Very thrifty
with CPU too.

A third thing we learned, is that we can run on at dizzying speeds under
stupifying load for as long as we had time to test, without
deadlocking. This was only possible due to our fixing instabilities in
core kernel, which ties into another thread we have seen here on lkml
recently.

Incidentally, we ran our tests with 128 knfsd threads. The default of 8
threads produces miserable performance on the SSD, which gave us a good
scare on our initial test run. It would be very nice to implement an
algorithm to scale the knfsd thread pool automatically, in order to
eliminate this class of thing that can go wrong. If somebody became
inspired to take on that little project that would be great, otherwise
it is in our pipeline for, hmm, Christmas delivery. (Exactly which
Christmas is left unspecified.)

Now, it is really nice that just throwing an SSD at our software makes
it run really fast, but naturally we want our software to go even
faster. Up till now, filesystem (and fancy virtual block device)
optimization has been mainly about reducing seeks, because seeking is
the big performance killer. This is completely irrelevant with an SSD,
because there is no seek time. That brings the second biggest eater of
performance to the top of the list: total data transfered per
operation. So to get even more performance out of the SSD, we must cut
down the total data transferred. In the case of ddsnap, that is not
very hard, mainly because my initial design was pretty lazy about
writing out lots of metadata blocks on each snapshot update. I knew
those writes would mostly go to nearby places, thus not adding a lot of
extra seeks. Now, with an eye to making SSD work better, we intend to
amortize some of that traffic using a logical journaling strategy.
This and other ideas waiting in the wings should cut metadata traffic
by half or more.

So what does the arrival of SSD mean for filesystem development in
general? Fortunately, reducing total data transfers is also a good
thing for rotating media, so long as the reduction is not obtained by
adding more seeks. The flip side is, reducing total data transfers
suddenly becomes a lot more important. Hence, Zumastor team will
concentrate on optimizing ddsnap for both SSD and rotating media. For
pragmatic reasons, most optimization work will continue to be directed
at the latter, but experience with this hardware has certainly changed
our thinking about where we are headed in the long run.

Warm thanks to all who read this far and thanks to Violin Memory for
providing us access to this very interesting hardware.

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/