File conglomerations

Iain McClatchie (iainmcc@ix.netcom.com)
Wed, 30 Jun 1999 16:09:36 -0700


I see a number of issues with current file systems. I have four
independent suggestions which I think can work together to help with
these issues. To see my suggestions, skip to the --------.

Files in Linux currently have two components. One is the file data
itself, accessed through the open/read/write/close interface. The
other component is the file's many attributes, which are read through
stat() and written through utime(), chmod(), chown(). The attributes
are also consulted changed as a side effect of many file operations.
Different file systems have slightly different attributes, which
require odd extensions to the Linux API to read and write. It would
facilitate interoperability to be able to move files from one FS to
another, and back, and retain the attributes unsupported by the
second FS. It would facilitate FS development to have an attribute API
which did not requires changes in order to add new attributes.

Many programs would like to store lots of configuration data with
directories and files. Examples are icons, links to the creating
program, and arrangement positions on a GUI. These programs do not
generally store this information as files inside directories (which
are themselves compound documents), but instead glob it together into
a single file, because
1. The FS will burn a minimum of 1K or 4K storing a very small
file, making for inefficient use of disk space.
2. Each tiny file must be opened seperately, whereas one big glob
requires many fewer trips through the O/S and FS code.
3. The user doesn't expect to see a directory, but rather a file.
4. The program makes assumptions about the structure of the data
in one of it's files, and guarantees that none of its operations
to the data will violate those assumptions. A novice user whose
shell or GUI showed the internals of the compound document might
make changes which would violate those assumptions.
I think the 3rd problem is a consequence of the first two, and the 4th
problem might be solved with something like capabilities or access
control lists... something that requires the user to assert that he
knows what he's doing before he modifies the contents of directories
with referential integrity issues.

All the same, the filesystem-in-a-file idea, as implemented in MacOS
and by MS Word, Excel, and so on, has some drawbacks which a compound
document-in-a-directory might not have:
1. An exposed compound document might serve as a better framework
for many otherwise unrelated programs to edit the various bits
of a single document.
2. Two layers of filesystem leads to two layers of index trees, two
layers of block management, more code development, potentially
slower operation, and so on. (MacOS stores both the data and
resource fork with a single storage manager, but IIRC it does
the FS-in-a-file thing for the resource fork.)

------------
Now for my suggested improvements to take on these issues. Some of
these improvements are from other people, I'm just organizing them
here to show that they play well together.

The ideas of file and directory can be unified.
1. Every directory can be opened as a file. For compatibility with
other O/Ses, existing programs, and network file access, the file
data of a directory can also be named by a special file name
within the directory, like ".default". The directory-as-file is
not any special filter applied to the directory contents, as in
Hans' current suggestion, but a simple byte stream just like file
contents today. This is essentially the suggestion that Linus has
made.
2. Every file can be opened as a directory. The file-as-directory
contains as subfiles its attributes. I believe this is a new
suggestion, and I think it neatly solves the attribute name space
API problem, at least as far as interoperability with other
filesystems goes. For compatibility with other O/Ses, existing
programs, and network file access, the file attributes are still
read/writeable by stat(), chmod() and friends, and the type is
returned as "directory" if other files have been added to the
directory, and returned as "file" if not. I haven't yet seen
anyone else suggest this idea, but it probably has come up before.

If all we do is unify files and directories, we get a low-performance
but consistent API to attributes, which allows files to be moved between
filesystems while preserving (though not updating or respecting)
unsupported attributes. We also need a fast API to read and write many
very small files. I suggested this idea in a different thread, but the
basic notion is

read_dir_files( const char *dirname, int maxfiles, size_t bufsize,
const char **filename, char **fileval, char *buf )

Here the system call gives the O/S a list of files to be read, a
buffer into which to concatenate the contents of all the files, and
a pointer array into which the O/S deposits a pointer to the first
byte of each file value. This interface is more bulky than stat()
to do the same thing, but it's completely extensible, and it also
can be used for things like access control lists, icons, links to
file creators, and the like. Applications can use the interface to
achieve faster execution when their host O/S supports it, and fall back
to open()/read()/close() and stat() when it doesn't.

This API extension together with the unification above gives us a fast
API to attributes and tiny files, but those tiny files will still be
somewhat inefficient in disk space and access speed. That's fine for
when cool new GUIs and compound document creation systems are used on
older filesystems (say, across a network), but we'd like to have a
filesystem that implements tiny files well. The most important
requirement here is that the file data be laid out in such a way that
one disk access is generally enough to pull a directory and all its
small subfiles into memory. Maybe reiserfs will do the job; maybe not,
I suspect it will be challenging to efficiently store and retrieve
4-byte "files" as well as inherited data
(directory/filename/.groupid/.accessbits _is_ directory/.accessbits)
and group data (directory/.accesstime _is_ the latest of
directory/*/.accesstime).

There is one other reason programmers write filesystems-in-a-file: they
wish to avoid the expense of rewriting the entire file when they are
adding data to the middle. The amoeba FS is at least one FS that allows
insertion into the middle of a file rather than overwriting.
Applications pretty much have to commit to this interface, so I'm not
surprised I haven't read about programming experience with this API and
whether it really does eliminate for a user-level fragmentation and
block manager.

Oh, there is one other change I'd like to see, that we could put in
independently of the rest, is that mv allow me to move a directory
across filesystems. I'mIs there something semantically imprecise
about this idea?

Whew. Okay, I'm done now. What do you think?

-Iain
iainmcc@ix.netcom.com

P.S. What IBM O/S did you work on?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/