The argument for fs assistance in handling archives (was: silent semantic changes with reiser4)

From: Jamie Lokier
Date: Wed Sep 01 2004 - 15:12:57 EST


I'm going to explain why filesystem support for .tar.gz or other
"document container"-like formats is useful. This does _not_ mean tar
in the kernel (I know someone who can't read will think that if I
don't say it isn't); it does mean hooks in the kernel for maintaining
coherency between different views and filesystem support for cacheing.

The vision I'm going for here is:

1. OpenOffice and similar programs store compound documents in
standard file formats, such as .tar.gz, compressed XML and such.

Fs support can reduce CPU time handling these sorts of files, as
I explain below, while still working with the standard file formats.

With appropriate userspace support, programs can be written which
have access to the capabilities on all platforms, but reduced CPU
time occurs only on platforms with the fs support.

2. Real-time indexing and local search engine tools. This isn't
just things like local Google; it's also your MP3 player scanning
for titles & artists, your email program scanning for subject
lines to display the summary fast, your blog server caching built
pages, your programming environment scanning for tags, and your
file transfer program scanning for shared deltas to reduce bandwidth.

I won't explain how these work as it would make this mail too
long. It should be clear to anyone who thinks about it why the
coherency mechanism is essential for real-time, and a consistent
interface to container internals helps with performance.

Horst von Brand wrote:
> Jamie Lokier <jamie@xxxxxxxxxxxxx> said:
> > When a simple "cd" into .tar.gz or .iso is implemented properly, it
> > will have _no_ performance penalty after you have first looked in the
> > file, so long as it remains in the on-disk cache. And, the filesystem
> > will manage that cache intelligently.
>
> Nonsense. The .iso or .tar or whatever would have to be kept un-isoed or
> un-tarred in memory (or on disk cache) for this to be true, and that takes
> quite a long time. Each time you want to peek anew at linux/Makefile, the
> whole tarfile will have to be read and stored somewhere,

Wrong. "So long as it remains in the on-disk cache" means each time
you peek at linux/Makefile, the tarfile is _not_ read.

For a tarfile it's slow the first time, and when it falls out of the
on-disk cache, otherwise, for component files you are using regularly
(even over a long time) it's as fast as reading a plain file.

You obviously know this, as you mentioned on-disk cache in the reply,
so I infer from the rest of your mail that what you're trying to say
is more about modifications than reading archives. That it would be
silly to keep working data in .tar.gz files, because working inside
them regularly would be slow.

Which means you must be assuming, incorrectly, that these .tar.gz
files are really kept up to date on disk with every component file
modification.

Which is silly. .tar.gz files are suitable for *transport* and
*archival*, not regular random access; it's almost rude of you to
suggest I didn't know that.

The proposal is that .tar.gz files (and others) are analysed on demand
and content cached on disk as it is read. Then subsequent reads will
be as fast as if you had unpacked the archives by hand, manually using
the tar command. This is obviously exactly the same as you do now,
with a small bit of added convenience.

The other part of the proposal is that when you modify a component,
the modifications are stored on disk in the same way as ordinary
files, using the regular high performance random access disk
structures. Nothing is done to recreate the archive at this point; I
think this is where you misunderstood and thus flamed.

_If_ after modifying components, you then read the .tar.gz as a file,
then (and only then) is it recreated, taking in the worst case the
same time as running the tar command.

The _only_ times when that occurs are precisely those times when you
would have run the tar command manually: because you only read the
.tar.gz file when you need the flat file for some purpose, such as
attaching it to an email, transferring by FTP or HTTP, or reading it
into a program that needs it in that format.

If there is anything about that strategy that doesn't make sense, then
I suggest I have failed to explain it properly, and you're welcome to
demand a clearer explanation.

> the .tar format is optimized for compact storage, the on-disk format
> of a filesystem is optimized for fast access and modifiability.

Actually no, .tar is not compact at all. It's also not optimised for
random read access, but after an index is build it is very fast for that.

.tar.gz is compact. Although that is not especially fast for random
read access, you can build a "compression dictionary index" which
optimises random read access even in a .tar.gz, without ever unpacking
the whole thing.

Some formats like .iso, .zip and .jar are optimised for compact
storage _and_ fast random access. They come with an index, and don't
need one to be built and cached.

These are not filesystem-like formats, obviously, but they are the
formats you need to pack and unpack when exchanging data with other
people. That's the _only_ reason they're on your disk (virtually;
they may not really exist some of the time).

> Now go ahead and enlarge a file on your .iso/.tar a bit...it
> will take ages to rebuild the whole thing. There is a _reason_ why there
> are filesystems and archives, and they use different formats. If it weren't
> so, everybody and Aunt Tillie would just carry .ext3's around, and would
> wonder what the heck all this fuss is about.

If you enlarge a file in your .iso/.tar subdirectory a bit... nothing
happens. Why would a smart programmer do anything so silly as rebuild
the archive at that point?

_If_ you subsequently read the .iso/.tar _file_, then and only then
does it rebuild. Once, after lots of component writes. The only time
you would ever do that is if you are specifically reading the archive
file, which means you actually want to use the repacked file at that
point, for example to FTP it somewhere or use as an email attachment.

If the filesystem does not do that on demand, then you would have run
the tar command manually at that point, precisely because that's a
point where you need the repacked archive. So in case where the
filesystem repacks the archive, it takes exactly the same time as you
would have taken anyway; it's just automatic instead of manual. (As a
handy side effect, the automatic method offers lower latency for
transmissions).

Now, why would we bother with all this?

I see three reasons: convenience, time efficiency, and storage efficiency.

Convenience is simply that it is handy to be able to look inside
archive files, in those situations where we _currently_ use them, without
having to manually untar when we need to, and without having to
remember to clean up old directories when we discover we aren't using
those often any more. This is _not_ an argument for using .tar.gz
files in place of ordinary directories! Convenience applies to doing
the things you do now.

Time efficiency has two angles. A simple one is that accessing
.tar.gz contents through any kind of filesystem interface, even pure
userspace, can be faster than unpacking whole files, simply because
there are ways to decode parts without unpacking it all.

However, the main time efficiency that I see comes from the increasing
number of applications where the "Open" and "Save" operations store
data in *.gz files (e.g. OpenOffice compressed XML documents), or
*.tar.gz files (some compound document formats), or other things like
that. (If you think about it, quite a lot of things are like that).

With these, every "Open" currently has to decompress and maybe unpack
an archive format. Every "Save" currently has to pack and then
compress. This is done so the user sees a single flat file containing
a complex document, but it is a waste of CPU time until the user
actually transports the flat file.

The lazy proposal, as described earlier in this mail, _removes_ these
decompression, unpacking, packing and compression CPU-intensive steps
when they are unnecessary. The experience of a single file containing
a complex document is maintained, but the CPU time is reduced in many
typical operations. "Open" gets faster after you first look at a
file, "Save" gets a lot faster for large documents, and the equivalent
of grep (or later, real-time local search engine) gets a lot faster
too. There is no operation where CPU time is overall increased.

This is what I've meant throughout this thread, when I say containers:
document files of the kind used to hold text, figures, etc. that are
typically transported as a unit, and edited as a unit, but nonetheless
at the moment they're stored in somewhat CPU intensive formats, for
compactness. That's fine for a 1 page letter, but think of the
OpenOffice 500-page book containing a large number of diagrams.

However, even simple programs that read & write compressed XML benefit.

The proposal allows that sort of thing to be handled more time
efficiently that it is today, and in a way that is very practical to use.

(It's unthinkable that OpenOffice and similar programs would have a
lot of code which stored data in a special way just for Linux, just
for these performance benefits which are otherwise user-invisible, but
it's thinkable that a general purpose userspace library which is
portable to all platforms could be written, which takes advantage of
the facility when it's available and does the equivalent of today's
"compress on save" when the filesystem facility isn't available).

Finally, storage efficiency comes from simply allowing the filesystem
and supporting tools to decide when it is best to store data in
unpacked, packed & compressed, both at the same time, or another
archival form. The filesystem has comparativaly good knowledge of
which data to archive and when, but it can only maintain the illusion
if there's a mechanism to make archived forms and unpacked forms
coherent.

Now, I'm sure there is a way to implement this on top of a neat and
simple kernel feature involving weird bind mounts, leases, dnotifies
and FUSE. But those kernel offering are quite a mess at the moment
and don't fit together in a way which can usefully create this effect.

Auto-mounting uservfs directories over file-as-directory, using
moveable bind mounts _nearly_ offers the kernel primitives we need to
build this in userspace and get the all the efficiencies. But not quite.

(We could obviously do it all in userspace by putting _everything_ in
a userspace filesystem, but that would be silly as it would throw away
all of the performance of having a threaded filesystem in the kernel.
It might do as a proof of concept though).

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/