Idea: remote block device driver

David Monro (davidm@fuzzbox.gh.cs.usyd.edu.au)
Tue, 18 Jun 1996 00:39:20 +1000 (EST)


I had an idea today for improving the performance of diskless boxes.
This idea has been had before at least twice with respect to swap, but I
thought it might be worth generalizing.

The idea is to have a network protocol for serving, not filesystems, but
chunks of disk. Rather than mounting a remote filesystem the client
attaches a remote chunk of disk as a block device, and is responsible
for managing the contents of the device at the block level.

I would envisage the protocol working as follows
1) Server side: the server will receive requests of the form 'read me n
blocks of data starting at offset m', in which case it will return a
packet containing this data. It will also accept requests of the form
'write the following n blocks starting at offset m' in which case it will
write the requested information and return an ack _after the blocks are
committed to stable storage_.

2) Client side: On receipt of a request from a higher layer (eg the
filesystem code) for a read, the client simply sends a read request,
waits until it returns and passes it up. It is free to cache reads to
any extent it wishes. On receipt of a write request, the client copies
the data into a buffer, initiates the remote write request and returns
to the layer above. Only after the remote server has acknowledged the
write is it allowed to throw the buffer away (although it may keep it
for read caching). Of course actually attemting to do caching at this
level is probably silly - the filsystem layer should do a better job.
This cache is the equivalent of the track buffer on the bottom of your
harddrive I suppose. I am also not sure if it is actually necessary to
copy the data, or simply lock the buffers it is already in - I haven't
looked deep enough into the Linux block device code yet.

I see no problem with multiple outstanding read and write requests with
out of order completion - just add a sequence number to the packets.
After all we already have this for SCSI I think (correct me if I am
wrong please!)

My guess is that there is a 90% chance that a suitable protocol already
exists, but I can't actually find anything obvious in my /etc/services
files. Anybody know for sure?

Note that this protocol is completely unsuited for use by multiple
clients unless they are _all_ in readonly mode. It implements what looks
like a standard block device to the layers above, which means that it
doesn't expect to have the contents change under it unexpectedly! It
should also be stateless - since the remote server doens't acknowledge
write completion until the data is on disk and the client keeps all
unackowledged data around, the remote server crashing shouldn't matter.
The client should probably simply block if the server is down for a
while or slow in acknowledging data to aoid large amounts of data eating
memory at the client end. This should prevent massive lossage in the
case of a server crashing and the client caching a large amount of data
for an extended period and then crashing itself without committing it.
After all if a box crashes while writing to disk the same sort of
lossage is possible. For the paranoid the client could implement the
sync mount option on a device and not return until the data is committed
to the remote stable storage (a la NFS).

Further thoughts: this could be implemented over udp or tcp. To maintain
statelessness with tcp I think all that is required is to attempt to
open a new connection if the current one seems to have died. Some
handshaking setup would be required to make sure the client and the
server both agree that the old session is dead to avoid bogus things
happening.

Uses: The really obvious application is faster network swap. However, it would
also be possible to use it for a shared, readonly filesystem (eg /usr on
a diskless box) if it is acceptable to have all clients unmount, flush
all cached data, and remount the filesystem on this device if any
changes are made to it. (The obvious way to modify such a filesystem is
using the loopback mount on the server). It would also be useful for a
nonshared r/w fs (eg the root filesystem for each diskless box),
again with the proviso that you need to be able to flush all information
at the client end if it is to be updated by the server. (There is
nothing to prevent changes being made by the client itself however).

So what do people think? If no-one has any reasons why this is _really_
stupid, I might have a little hack at doing a very cut down version (say
no caching, only one outstanding command, over udp) and see if it looks
feasible. Look out for /dev/rbd, coming soon to a box near you!

An odd note - it would actually be necessary to fsck any fs on such a
device after a crash. Feels kind of weird.
Even weirder - to improve performance, it should be possible to put two
ethernet cards in both the client and server, run two wires between the
machine, have one of these remote block devices running over each link,
and use md to stripe the result! _Really_ kinky ;-)