Re: undelete?

Christian von Roques (roques@pond.sub.org)
27 Jul 1997 13:59:51 +0000


The tall cool one <ice@mama.indstate.edu> writes:
> Darin Johnson <darin@connectnet.com> writes:
> > > Compression should be left to user space apps, which can be written
> > > to deal with the problem in a more sane manner.
> > Except then that every app needs to understand compressed files, and
> > currently, they don't. [...]
> And when every app does understand compression, you won't have to worry
> about which FS you're trying to do compression on, [...]

Arguing, if user-transparent support for compressed files has to be
done in user or kernel-space does slightly miss the point. It should
not be an integral part of all user programs, although unarguable
that would be the most flexible solution. It can and should be
factored out of the user's programs, like TCP/IP, windowing- and
file-systems. I think, that we can agree, that it has to be between
the user's program [shared libraries are not a part of it] and the
lower parts of the file-system doing block-based IO.

My opinion is that implementing transparent compression as a virtual
file-system layer on top of other file-systems is the way to go. But,
the ideal underlying file-systems probably isn't there yet, it will
have to provide some not yet available features like storage of more
than just a fixed set of attributes and multiple flexible data-
portions per file [e.g. compressed and uncompressed contents, usage
statistics maintained and updated by a daemon or kernel, several old
versions to allow sending patch(1)-style updates to currently
disconnected peers, ...]. It should be possible to simulate these
features by mapping one user-level file to several low-level files,
but that will be inefficient in both space and time.

I've heard, that the Macintosh's HFS [and the `new NT FS'?] supports
several `data-forks' and would like to hear a short summary of its
design and API as well as its biggest [dis-]advantages in praxis.
Maybe we can learn from their experience.

> Miquel van Smoorenburg <miquels@cistron.nl> writes:
> > [...] You just divide each file into chunks of (say) 4Kb. You then
> > compress that 4Kb block. If it becomes 1Kb, great, you put it in the
> > first block of 4, and the other 3 blocks are empty. Since empty
> > blocks do not take up disk space (holes), that's no problem.
> >
> > So you don't compress the file as a whole, you compress it on a 4Kb or
> > 8Kb block basis.

Splitting the files in fixed sized blocks and compressing them
independently does solve the problem of extremely slow seek(2)s on
huge files, and is the way the XPK-project for Amigas went. My
experience with XPK and unices indicates, that most files are accessed
sequentially. [This is not true for databases, executables and shared
libraries randomly faulted into address-space.] Implementing the read
VFS-operation to sequentially read compressed files is not a problem.
This is true for sequentially accessed mmappings too. Files accessed
randomly should either be [de]compressed in blocks [maybe with a per
file [group/type?] common compression context to at least somewhat
improve the compression ratio?] or decompressed as a whole from the
beginning [and temporarily stored on disk [or buffer-cache] for
further random access].

> Now, solve this one, you open the file read/write and you start changing
> the file in random places. Since even changing one byte will change the
> compression dynamics you end up re-compressing/re-writing everything
> following the position at the write(), block compression or no... I
> challenge you to code that efficiently.

My experience with computers indicates, that most activity comes in
bursts, we therefor should not try to compress on the fly when the CPU
has better things to do [unless it reduces IO and that's the current
bottleneck]. We should rather store the uncompressed data and
compress it later when we're in need of disk-space or have nothing
better to do. On a side-note, we shouldn't forget the uncompressed
version of a file just when we've compressed it, we should note that
we _can_ reuse its disk space, if needed.

The next step to the `filesystem of the future'[tm] probably should be
to evaluate current VFS-interfaces. Can anybody provide me with some
pointers to overview articles or WWW-sites? What is good/bad about
the VFS-layers in use today? Or is VFS-design thought to be to
trivial to be made wrong?

Christian.