Re: journaling filesystem

Victor Yodaiken (yodaiken@chelm.cs.nmt.edu)
Mon, 16 Jun 1997 09:36:42 -0600


On Jun 16, 4:02am, stephen farrell wrote:
Subject: Re: journaling filesystem
>But you cannot guarantee that you'll make it through commiting these
>changes, so I don't understand what this gains you... at best it
>sounds like you'd just be minimizing the window in which a crash will
>hurt you.

I haven't thought about this issue for a long time, and perhaps
I've become even more confused, but I don't think you are correct.

You can have a safe commit. The algorithm, as is standard
for DB logging, is to keep two instances of the FS -- one that
is a primary, consistent, version, and one that is the current
version. Suppose we have two SuperBlocks, one primary, one current
with each pointing to its own free list and inode table.
All data is written to free blocks, all freed blocks
are placed on a "zombie list" until commit. The inode table is
used to allow new inodes to be allocated on changes -- so that
inode numbers are not associated with fixed disk locations.
The commit can be done in several ways, but to start, assume
that we simply flush all dirty buffers and then write the
current SB and current inode table, then we mark the current
SB as the primary SB -- this write is atomic and flips us from
one consistent FS to another. On recovery or reboot, the
primary is used, the zombie list is copied to the free list, and
the primary is written over current.

The FS I'm remembering was designed for a relatively simple straightforward
Unix FS, but there are numerous obvious optimizations -- e.g. to
use the scheme per cgroup or to set aside pairs of cyl-groups, adding
a 3rd level for in memory current while current is being flushed ...
One thing this algorithm does not do is work well when the disk is
nearly full, but fault-tolerance must cost something, and disks are
cheap. The algorithm can work well with a simple buffer cache
flushing algorithm, while full journaling seems to require that
buffers get flushed in a particular order.
I think that is a compelling advantage because the interaction of
VM paging and FS buffer cache use is too complicated as it is.

I'm a little skeptical of the Sprite LFS/Zebra/ and similar
journaling FS schemes for several reasons.
1. The recovery time seems like it could be quite long.
2. If the buffer cache has any misses at all, the performance collapses.
Not all FS use looks like the CS compile/run/edit/read mail/ cycle
and performance on long sequential reads would be interesting.
3. The paradigm of dumping everything in one massive write is suspect
in a HA environment: 128M buffer flushed at 60ns
per word is about 2 seconds. That's a very long time.

Here's a reference which describes the FS briefly and also talks about
the other virtues of the Auros OS which died along with my stock
options so many years ago.

@Article{BorgBlauGraetschHerrmannOberle89,
key = "Borg et al.",
author = "Anita Borg and Wolfgang Blau and Wolfgang Graetsch and
Ferdinand Herrmann and Wolfgang Oberle",
title = "Fault Tolerance Under {UNIX}",
journal = "ACM Transactions on Computer Systems",
pages = "1--24",
volume = "7",
number = "1",
month = feb,
year = "1989",