[RFC] An alternative interface to device mapper

From: Daniel Phillips
Date: Wed Mar 05 2008 - 04:30:31 EST


Q: If you have an axe with a rusty head and a rotten handle, how do you
repair it?

A: It is a two step process. First you replace the handle, then you
replace the head.

At one time, volume management on Linux was new and shiny. Today it has
fallen well behind Sun, FreeBSD, NetApp and even Microsoft. In order
to avoid losing even more of the storage "market" than we have already
lost, we must make a concerted effort to catch up with the state of the
art, or ideally take the lead as has proved possible with so many other
aspects of Linux.

This RFC is about replacing the handle of our not-so-shiny LVM axe.
Design goals for this alternative "ddsetup" interface are:

* Convenient to embed in a C program
* Make it simple enough that a library is unnecessary
* Support for creating detailed, accurate error messages
* Error messages delivered to caller rather than logged
* Naturally extensible as new requirements emerge
* 32 bit ABI works on 64 bit kernel without translation
* Avoid bad API practices identified by [ARND 07]
* Do not break the existing ioctl interface

The patch below includes a new kernel interface generator called ddlink.
The ddsetup device mapper interface is an instance of a ddlink
interface, instantiated by supplying domain-specific methods for read,
write, ioctl and poll.

In more detail: ddlink is a generic pipe-like interface for controlling
device drivers. It was inspired by Trond's venerable and successful
rpc-pipefs, which he invented to control various aspects of NFS server
and client operation. ddlink takes the form of a virtual filesystem
with no namespace. It provides application programs with fd objects
that can be read, written, ioctled and polled, suitable for efficient
binary communication with kernel components. Read, write and poll
operations act similarly to a pipe. Unlike a pipe, there is no write
buffering. Each write to a ddlink directly triggers some kernel
handler. Reads are buffered via an output queue of ddlink "items", each
of which is an unrestricted blob. Ioctls on ddlinks are unrestricted
and the ioctl command space is unpolluted.

There are no partial reads on ddlinks. A read call either provides
enough space to hold the next outbound kernel item or EIO is triggered,
meaning "make your buffer bigger and try again". In practice this
arrangement takes the onus off the userspace program to buffer partial
reads in order to reassemble input that would otherwise be brutally
dismembered. As a bonus, the kernel code for ddlink is considerably
simplified versus Trond's rpc-pipefs precursor.

Unlike a pipe, there is no waiting for input on a ddlink: if there is
nothing to read then the read returns immediately with zero length. If
some other behavior is desired it can be obtained using poll.

There is a simple framework to provide for generalized allocation and
destruction of dditems, the internal transaction unit for a ddlink.
Finally, there is a small library of helper funcations that are useful
for creating domain-specific ddlink interfaces. The core code for
ddlink is about 150 lines of lindented C, plus 100 lines of library
functions and a header file that has already ballooned to the
intimidating size of 50 lines. In other words, ddlink is about as
light and tight as an interface gets. It is also highly efficient,
flexible and extensible, and requires very little boilerplate code.

I do not know whether the lack of a ddlink namespace is a bug or a
feature. In current usage, a ddlink fd is delivered to an application
program by some out of band means. For example, to get a ddsetup ddlink
to control device mapper, you ioctl /dev/mapper/control. Makes sense?
One can imagine many other methods of obtaining a ddlink, for example,
by reading a fd number in ascii from a sysfs file (gah). In practice,
lacking a namespace feels more like a feature than a bug.

A quick tour of the ddlink library:

int ddlink(struct file_operations *fops,
void *(*create)(struct ddinode *dd, void *info), void *info)

Create a new ddlink fd that will use the supplied file operations.
An optional create method may be supplied and arbitrary state
information supplied via "info". (I am not sure why I set up the
create as a callback, but I do recall that when I tried to do it
otherwise, conciseness of usage was degraded. I might revisit this
detail.)

void ddlink_queue(struct ddinode *dd, struct dditem *item)

Queue an output data item onto the tail of the output queue.

void ddlink_push(struct ddinode *dd, struct dditem *item)

Push an output data item onto the head of the output queue. Useful
for error messages or transaction-type calls that can return data
without disturbing any preexisting queue contents

struct dditem *ddlink_pop(struct ddinode *dd)

Remove a data item from the front of the queue

struct dditem *dditem_new(struct ddinode *dd, size_t size)

Create a new dditem with the indicated data size, whose char data
area is available as item->data

void ddlink_clear(struct ddinode *dd)

Destroy all queued data items

int ddlink_ready(struct ddinode *dd)

Returns true if the ddlink queue is nonempty

unsigned ddlink_poll(struct file *file, poll_table *table)

A simple poll method that only supports polling for input,
nearly always just what is needed

int ddlink_error(struct ddinode *dd, int err, const char *fmt, ...)

Format and push an error item onto the queue so that the error
text will be retrieved by the next read from the ddlink

So that is ddlink, short and sweet. Only a couple of trivial glue
functions were omitted. The remainder of this note is about ddsetup,
which is the single extant example of a ddlink interface.

Device Mapper is actually a lot more capable than most people know.
Each device mapper device consists of two layers: the virtual device
exposed to applications, and an underlying table of "target devices",
each of which is effectively a virtual block device in itself.

The "map" part of device mapper is about translating each bio directed
at the virtual device into one or more bio transfers to some contiguous
subset of the underlying device table. In addition, device mapper
implements stacking, whereby additional layers of virtual devices can
be inserted into an existing top level virtual device while that device
remains open and in use. Details of how this is accomplished are
surprisingly simple, but outside the scope of today's note. The
important thing here is to get a sense of just how rich the device
mapper interface needs to be in order to expose the full range of
device mapper capabilities to application programs.

Device mapper is currently exposed to userland via a stupifyingly
complex interface, in which 16 different device mapper subfunctions are
multiplexed via ioctls through one grand unified parameter structure.
This interface has proved so unwieldy that exactly one userspace program
uses it, namely libdevmapper. Unfortunately, libdevmapper brings its
own oddly structured interface to the party, in which a series of ioctl
calls is recast as a "task", a thoroughly unsuccessful abstraction. As
far as I know, the libdevmapper interface is only used by three
userspace programs: dmsetup, lvm2 and cryptsetup. There may be others,
but the point is, if this interface were well suited to its task then
there would be lots of programs using it by now. Instead nearly all
device mapper usage continues to be scripted via dmsetup or (less
commonly) lvm2 commands, or carried out manually using the lvm2
interactive interface.

This many years into the effort we ought to be slicing and dicing
volumes as second nature, changing configuration on the fly,
transparently expanding, shrinking and migrating filesystems, and many
other things that ZFS and GEOM are already doing and we are not. It is
not so much that device mapper is incapable of such fancy tricks, but
that we have taken a very powerful kernel subsystem and hobbled it with
a nearly unusable application interface. Think about a jet turbine
racecar with a two inch air intake.

So here I have attempted to create a granular interface to expose the
same functionality as the existing device mapper ioctl interface does,
but in a transparent and easy enough way that no library is required,
and concise enough that when you need to, you can realistically embed
your lvm operations inline in a C program.

The ddsetup design does not abandon ioctls entirely. Though ddlink is
perfectly capable of implementing the entire interface via rpc-like
write and read calls with function codes included in-line, this style
does not map well onto the expressive capabilities of C. A mix of
writes and ioctls ended up looking better on the page and is easier to
write.

In general, ddsetup uses write calls for variable length data and ioctls
for fixed length structures. One could also say: writes for data and
ioctl commands for, ahem, commands.

There is a little state machine inside a ddsetup fd that keeps track of
where you are in midst of a complex call sequence, particularly the
device create sequence. Using ddsetup, you push strings onto a stack
with write calls and turn the strings into more complex objects using
ioctls. If you make a mistake and get a -1 error return from any
of the calls, you can then read from the ddlink to get a text
description of what went wrong, complete with message formatting
courtesy of the ddlink_error convenience function.

For example, given a ddsetup fd named dd:

ioctl(dd, DMTABLE, &(struct ddtable){ .targets = 1 });
write(dd, "linear", 6);
write(dd, "/dev/hda5", 9);
write(dd, "1234", 4);
ioctl(dd, DMTARGET, &(struct ddtarget){ .sectors = 10000 });
write(dd, "foo", 3);
ioctl(dd, DMCREATE);
read(dd, &result, sizeof(result));

Leaving out error handling for clarity, this creates a 10,000 sector
virtual device named "foo" which is a linear mapping of hda5 starting
at sector offset 1234, equivalent to the shell command:

echo 0 10000 linear /dev/ubdb 1234 | dmsetup create foo

Except that we did not use the shell, or a library, just a header file
to name the ioctl commands and provide some simple interface structs.
(Actually, the example above is more complex than necessary. I do not
think the .targets field serves any useful purpose, and I will make it
go away soon.)

Reflecting device mapper's table structured arrangement, the sequence
from the "linear" write to the DMTARGET ioctl may be repeated
arbitrarily many times to build up a complex mapping. In fact, this is
how device mapper maps extents from an lvm partition to your
lvm "partitions". Easy, no? Well it is when written as above.

Such mappings are not restricted to linear targets. Some fancy mappings
have linear targets at each end and a temporary mirror of two devices
in the middle. This is how lvm2 implements pvmove, its clever ability
to relocate physical targets of a virtual device while the device is
running. Powerful, and practically unknown to most Linux users.
According to me, that is because it is hard to write programs to drive
such functionality. As a result, when Linux users do it, they do it by
hand. Solaris users are having a party with this kind of thing, and
laughing at us. Really.

There is not a lot more to say about ddsetup, which is actually the
point. It is pretty obvious how to use it, and how it is implemented.
I did have to do some pretty serious spelunking in dm-ioctl.c to ferret
out the bits of device mapper that do the actual work, in some cases
having to go pretty deep to work past dependencies on the monolithic
ioctl interface struct. Some header files needed to be rearranged,
arguably into the form they should have taken in the first place.
There are some needlessly strange object lifetime rules to deal with
internally, but otherwise this was a pretty straightforward romp.

An early version of this code was shown to Eric Biederman and Alasdair
Kergon at OLS last year. After taking all the C99 bits out, I managed
to convince Eric and others to actually read the examples. Let us now
see if the (positive) reaction I observed at that time survives wider
scrutiny.

A diffstat for ddlink and ddsetup together:

Documentation/ioctl-number.txt | 1
block/ll_rw_blk.c | 2
drivers/Makefile | 1
drivers/ddlink.c | 294 ++++++++++++++++++++
drivers/md/Makefile | 1
drivers/md/dm-ioctl.c | 593 ++++++++++++++++++++++++++++++++++++++---
drivers/md/dm-table.c | 52 ---
drivers/md/dm.c | 53 ---
drivers/md/dm.h | 78 +++++
include/linux/ddlink.h | 41 ++
include/linux/ddsetup.h | 36 ++
include/linux/device-mapper.h | 37 ++
12 files changed, 1056 insertions(+), 133 deletions(-)

Currently, this implements a majority of the device mapper interface
calls, but not all of them, so expect another hundred or two lines
before completion. This is still significantly less code than the
original ioctl interface (which is still in there) and much clearer.

I have written two example programs, ddsetup.c and ddcreate.c. The
former aims to be a drop-in replacement for dmsetup.c and the latter
implements a (useful) demonstration command that creates a virtual
device consisting of a single device mapper target, with all target
parameters supplied on the command line.

For example:

ddcreate foo 10000 linear /dev/ubdb 1234

ddcreate is 71 lines long including plenty of whitespace, while being
quite general. My message is about the 71 lines.

So who uses this ddsetup today? Answer: nobody. Better answer: the
ddsnap cluster snapshot driver has a usability problem because of its
reliance on PF_LOCAL sockets to glue components together. Filesystem
based sockets were adopted for the component glue because it is hard to
do anything more elegant working with the command line device mapper
setup utility. We looked at hacking the device mapper ioctl interface
to do what we needed, but then if we were willing to go that far then
why not just drop the other shoe and improve the userspace interface to
the point where it is actually pleasant to use, and maintainable too?
This is how ddsetup was born.

The next thing we need to do with this interface is demonstrate a solid
use case by adopting it on an experimental branch of zumastor. I
expect both ddsnap and zumastor systems to shrink as a result,
including significantly shrinking the documentation. This has not yet
been done yet, and until it is, this effort deserves to be firmly
relegated to the "nice but so what" category. So, profound thanks to
you, dear reader, for having had the stamina to read all the way to
here, and we will see you here again after having eaten this delicious
new dogfood ourselves.

http://code.google.com/p/zumastor/source/browse/www/ddsetup/ddsetup-2.6.23.12
http://code.google.com/p/zumastor/source/browse/www/ddsetup/ddsetup.c
http://code.google.com/p/zumastor/source/browse/www/ddsetup/ddcreate.c

[ARND 07] How to not invent kernel interfaces, Arnd Bergmann,
arnd@xxxxxxxx, July 31, 2007
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/