Re: scary ext2 filesystem question

Alexander Viro (viro@math.psu.edu)
Sun, 27 Dec 1998 10:04:25 -0500 (EST)


On 25 Dec 1998, Mirian Crzig Lennox wrote:

> Kurt Garloff <K.Garloff@ping.de> writes:
> > On Fri, Dec 25, 1998 at 11:42:57AM -0500, Mirian Crzig Lennox wrote:
> > > _Practical File System Design_, Dominic Giampaolo, p. 36
> >
> > Nonsense!
> > The ext2fs uses write cacheing, like any powerful filesystem does.
> > This cannot confuse any program. Any program that reads data from the disk
> > goes through the page cache, so it get's the recent data, whether it was
> > written to disk yet, or not yet. It is guaranteed to be written to disk
> > sometime, thats what bdflush/update and kswapd are for. Un unmounting the fs
> > all buffers are flushed, even if you managed to kill your bdflush before.
>
> That's exactly what I was thinking.
>
> Obviously, if [any] computer system crashes, bad things can happen.
> That was not my point of confusion; rather, I was bewildered because
> the author seemed to be implying that these kinds of problems could
> occur *even during normal filesystem operation*. I couldn't figure
> out how that could be, unless it was due to some kind of bug in the
> caching code, not a flaw in the design of the filesystem.
>
> > I noted the name of the author in my ignore list ... obviuosly he did not
> > understand anything!
>
> I'm inclined to agree, especially since elsewhere he refers to ext2 as
> "the fast and unsafe grandchild" of FFS.

Sigh... It's not about ext2 per se. It's about our implementation.
Please, before flaming on that topic, look for an article about
softupdates and READ it. Point being that much better warranties may be
provided *without* the cost of totally synchronous implementation. IIRC
the overhead was below 5% *in initial implementation*. Overhead compared
to async variant, that is. It can be done for Linux implementation but
we'll have to clean the VFS stuff up before.
In short, idea behind s-u being that logically we have two copies
of filesystem: on-disk one and in-core+disk one. Any implementation does
changes into the later and then somehow synchronizes the former. One can
think about dirty part of buffer cache as a patch about to be applied to
on-disk fs. Now, the differences begin: different chunks of said patch
(i.e. different changes to inode table/directories/datablocks/bitmaps)
are not independent. If we want to guarantee that on-disk copy is
consistent at any moment we must submit these changes in _some_ order. The
simplest way to do it is a full-sync, i.e. block any further changes until
the current one will be applied to disk. Bad, since it hampers performance
big way. More clever strategy looks so: for each change applied to in-core
fs we keep the following triple: (data before, data after, dependencies).
Now, when it comes to flushing dirty pages to disk we do the following:
first of all, flush those buffers that don't have any pending
dependencies. Then, if we want to flush more - pick a dirty buffer, look
for pending dependencies on it, *make a copy with those changes unrolled*,
and submit it to disk. Now, for each write request we have a call-back
function that will be called when driver will finish. Well, make it
release the records about changes that were committed to disk during that
request.
Overhead of the strategy above compared to complete async consists
of two parts: housekeeping (we are getting new data structures, after all)
plus the cost of dealing with buffers that have to be written several
times. Experiments show that the later case is pretty rare, especially if
we flush the buffers with no dependencies first. And it gives much better
warranties in case of crash.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/