A file system like Digital's AdvFS

Larry Butler (larrybutler@earthlink.net)
Tue, 29 Jun 1999 19:22:34 -0600


Hi,

I have been thinking about writing something similar to Digital's Advanced
File System, except without the journaling feature. For anyone not familiar
with AdvFS, it really is more than just a file system.

With AdvFS, one or more block devices are grouped together into a "file
domain." Several file domains can exist on one system. A file domain is
just a logical pool of storage space. I say "pool" because there is not
user-visible structure imposed on the file domain. It is just a place from
which i-nodes and data blocks can be allocated. Each file domain can have
several file sets in it. A file set is just a file system. So for example
you could have a file domain called "tmp" with a single 100 MB device. This
file domain could have two file sets called "fs1" and "fs2" in it. And your
fstab might have entries like:

tmp#fs1 /tmp AdvFS
tmp#fs2 /var/tmp AdvFS

All the available space in the "tmp" file domain is available to either file
system.

The really neat feature is that the underlying storage can be rearranged wile
the file systems are in use. For example you could add a 500 Mb device the
the "tmp" file domain and suddenly both file systems have more available
space. You can then remove the first device as long as there is still
enough remaining space to hold all the files in all the file sets.

I think I would just call a "file domain" a storage pool and a "file set" a
file system.

Below are some thoughts on the implementation of this "storage pool file
system." I would be interested to hear what you think about it.

Physical Layout

Each physical device in a storage pool has a number of i-nodes and data
blocks that it contributes to that pool. The physical layout of the data on
each device can be the same as a traditional unix file system with some
i-nodes, data blocks and bitmaps arranged in groups.

The differences would be:

1) The block pointers in the i-nodes and indirect blocks have to address
blocks that are potentially located anywhere in the pool. That is, a
block is not necessarily on the same physical device as the pointer to
it.

2) The i-node numbers in the directories have to be able to name any
i-node in the pool (similar to #1).

The first problem is to decide how to address the i-nodes and data blocks.

Simply assigning n bits for the device number and m bits for the block
number is probably a waste of space because both m and n would have to be
large enough for any user's needs. This would make the i-nodes and
directories unnecessarily large and make things unnecessarily slow.

I am thinking about a scheme where each device has a section of the "block
number space" and the "i-node number space" that is large enough to contain
all the blocks and i-nodes on that device, but not too much larger than it
needs to be. To make look-ups very fast, a device's part of the block and
i-node spaces would be characterized by unique values of the most
significant bits in the address of the data blocks and i-nodes in that pool,
but the length of that prefix would be variable. This is very similar to
the way that IP addresses are assigned in a sub-net.

Lets look at an example with data blocks.

Say you had a pool with two devices. The first have 2GB of data blocks and
the second has 9GB of data blocks. If they are added to the pool in that
order you would see something like the following.

Assume 32 bit block numbers and 1Kb blocks for simplicity.

2GB = 2 * 2^30 bytes = 2 * 2^20 blocks = 1 * 2^21 blocks
9GB = 9 * 2^30 bytes = 9 * 2^20 blocks = 9/16 * 2^24 blocks

So the 2GB disk requires 21 bits and the 9GB disk requires 24 bits to
differentiate its blocks.

The prefixes 00000000000 and 00000001 would be assigned to the devices.

So blocks on the first device would have addresses that looked like this:

00000000 000xxxxx xxxxxxxx xxxxxxxx

And blocks on the second device would have addresses that looked like this:

00000001 xxxxxxxx xxxxxxxx xxxxxxxx

So the entire block number space would be divided up as follows:

1. 00000000 00000000 00000000 00000000 - 00000000 00011111 11111111 11111111
2. 00000000 00100000 00000000 00000000 - 00000000 11111111 11111111 11111111
3. 00000001 00000000 00000000 00000000 - 00000001 10001111 11111111 11111111
4. 00000001 10010000 00000000 00000000 - 00000001 11111111 11111111 11111111
5. 00000010 00000000 00000000 00000000 - 11111111 11111111 11111111 11111111

Region #1 would be for the 2GB device.
Region #2 would be unused and available.
Region #3 would be for the 9GB device.
Region #4 would be unused and not available.
Region #5 would be unused and available.

The "i-node number space" would be divided up in a similar way, but the part
of the "block number space" used by a device is not related to the part of
the "i-node number space" it uses. The proportion of i-nodes to data blocks
can be very different on different devices. In fact, I see no reason why a
device couldn't contain only data blocks or only i-nodes.

The second problem is that user-visible i-node numbers cannot change when an
i-node is migrated from one physical device to another.

To solve this there will be a table of i-node numbers for each file system
in the storage pool. The entries in this table will function as logical
i-nodes for that file system. Although not visible to the user, this table
will be stored as normal file so that it can be easily relocated just as a
normal file can be.

There would also have to be a table of file systems stored in the same way.

The super block would probably contain the real i-node number of table of
file systems. Each entry in the table of file systems would contain the
real i-node number of the logical i-node table for that file system. The
root i-node of a file system would be the first i-node in the logical i-node
table for that file system. Directories would contain an index into the
logical i-node table and this index would be the user-visible i-node number.

When a device needs to be removed from a storage pool all data blocks and
i-nodes on that device will have to be relocated. When moving an i-node you
need to know what entry in the logical i-node table is referencing it.
Similarly, when moving a data or indirect block you need to know what i-node
or indirect block is referencing it.

Although I am not quite sure, it seems reasonable to have the i-nodes know
from what entry in a logical i-node table they are referenced, so that when
the i-node is moved it is possible to directly locate the reference that
needs to be changed. This would require two extra fields in the i-node.

It does not seem reasonable, however, to have each data block know from
which i-node or indirect block it is referenced. This would require at
least a pointer for each block in place of an existing single bit.

So, when a device needs to be removed it will have be marked for removal and
then each i-node will be visited to determine if (1) any data blocks or
indirect blocks need to be relocated and (2) if the i-node itself should be
relocated.

In pseudo-code it would be:

mark device(s) for removal
for each device D
for each i-node I on device D
for each data and indirect block B referenced by I
if B is on a device to be removed
allocate a new block B'
copy B to B'
deallocate B
end if
end for each B
if D is the device to be removed then
allocate a new i-node I'
copy I to I'
update the logical i-node table to point to the I'
deallocate I
end if
end for each I
end for each D

Each device would have its own copies of the super block. Keep in mind that
any data that is not part of the super block (and thus not copied on to
every physical device) must be movable.

The superblock would contain at least the following:

magic number
Identifies a storage pool member device

super block layout version
The version number of the superblock data structure

pool id
A number that identifies the pool; this would be a system-generated
number that can be used to determine what set of devices make up a
storage pool.

super block data version
This number would be incremented each time a change is made to the
logical, in-memory superblock and it is written back to the disk.
This would allow detection of out-of-sync superblocks on different
device when we are registering a storage pool.

number of devices
The number of physical devices in this pool.

device status flags
We probably have to mark a device as "in the process of removal" when
we're migrating blocks and i-nodes off of it. There are probably
other uses for this field too.

block size bits
The exponent for the block size (10 = 1Kb).

number of data blocks
The number of data blocks on the device.

block number prefix
The block number prefix for all the data blocks on this device

block number bits
The number of bits in the block number prefix

i-node number prefix
The i-node number prefix for all the i-nodes on this device

i-node number bits
The number of bits in the i-node number prefix

"table of file systems" i-node number

Interface

There has to be some kind of interface to perform operations on a block
pool. Some of these operations are:

create a new storage pool
add a device to a storage pool
remove a device from a storage pool
create a new file system in a storage pool
remove a file system from a pool

There are a couple of design goals or principles that I want to follow:

1) Any operation performed on the pool should be done in the context of the
process that is performing the operation. That is, without a kernel
thread.

2) I don't want to add a new system call.

3) I don't want the kernel to have to "know about" pools that aren't in
use. This would unnecessarily constrain user-space operations on the
storage pools and could lead to situations where the kernel is out of
sync with the actual data on the disks.

The kernel only needs to know about a storage pool when one of it's file
systems is mounted or a user process is manipulating the pool in such a
way that would require the kernel to be involved.

4) I don't want to put a device in /dev for each storage pool or list all
the pools in /proc because the kernel doesn't know about pools are not
currently in use. I would prefer to handle this outside of the kernel.

We will do the "create" operation in user-space (like mkfs) since we know
nothing else could use the pool until it is created. Also the "delete"
operation implied by over-writing one of the devices is done from user
space.

All the other operations will have to be done with help from the kernel.
Even if no file system is mounted, you might want to mount one before the
operation is complete.

The idea I have settled on revolves around a single character device (maybe
called /dev/bpctrl). When a user process wants to perform an operation on a
storage pool it would open /dev/bpctrl and call an ioctl with a pointer to a
data structure containing the device numbers of all the devices in the pool.
The kernel would look to see if it already knows about that pool.

If the kernel does not know about the pool it would read the superblock from
all of the devices and check that they exactly specify a single storage pool.
Then create an in-memory data structure to describe the pool.

If the kernel already knows about the pool whose devices were given, it
would simply use the structure it already has.

In any case the "struct file" would be made to reference that storage pool's
data structure and the usage count of the data structure would be increased.

The file descriptor obtained by the user process would now be a "handle" for
that storage pool. Other ioctls could be used to perform operations on the
pool. Any I/O done on the operations would be done by that process.

The storage pool data structure's usage count would reflect the number of
handles referring to it plus the number of mounted file systems contained in
it. When that usage count dropped to zero the kernel would "forget" that
that storage pool exists.

By specifying the devices that make up the pool when referring to it, the need
to explicitly register and unregister them is eliminated. This allows for
a greater level of independence among multiple processes operating on a
single storage pool while still allowing the kernel to "forget" about block
pools that are no longer in use.

There would be a library layer on top of the kernel interface that would
maintain a table in /etc that would list all the known storage pools, the
devices that make them up and the file systems in them. This library would
provide symbolic names for storage pools and file systems.

Some of the functions would be:

int bp_new_pool (const char *bp, const char *device);
This would setup the data structures on the device for a new block
pool with one device and no file systems.

int bp_new_fs (const char *bp, const char *fs);
This would create a new file system in an existing pool.

int bp_add_device (const char *bp, const char *device);
This would add a device to an existing pool.

int bp_remove_device (const char *bp, const char *device);
This would remove a device from an existing pool. This could take a
long time, maybe we will want a call back to a function that could
report the progress.

int bp_mount (const char *bp, const char *fs, const char *mount_point);
This would mount an existing file system in an existing storage pool
on the given mount point.

I don't know if it should get a handle for the pool and then pass
that to mount(2) or just pass the list of physical devices directly.

I'm thinking about 5 byte data block numbers. With 1KB blocks that would
give 2^50 (1 exabyte, right?) - 1KB of file storage and 204 pointers on an
indirect block. I think we need to have at least twice the maximum that
would be useful, otherwise the data-movement features could not be used as
easily. How many direct, indirect, doubly indirect, etc pointers should
there be in the i-node?

Kernel Level Implementation

The implementation should be straight forward except for making it possible
to move an in-use disk blocks. I suspect this is possible because of the
universal use of the buffer cache to access block devices.

This is where I need some advice. The VFS can ask the file system to
provide the physical location of a part of a file. The VFS can then access
those block devices directly.

I think it should be possible to

1) put call back functions in the VFS to say "If you are using device n,
block m, use device x, block y instead."

and/or

2) re-write the buffer head to look like it came from a different device and
then re-file it? The VFS would probably still need to know when this
happens. Also, I don't think the buffer cache was designed to have those
fields changed on the fly. But it would be nice to avoid the extra I/O
to write out the old block and read in the new block when a block is
moved. BTW is there a function like bread() that gives you an already
dirty buffer with undefined contents that. That way you don't have to
read in data that you are just going to overwrite.

Any suggestions would be greatly appreciated.

-Larry

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/