Re: Fault tolerance. . .

From: Valdis . Kletnieks
Date: Mon Jul 25 2005 - 09:58:22 EST


On Sun, 24 Jul 2005 21:59:59 EDT, John Richard Moser said:

> I'm thinking of application level fault tolerance using roll-back states
> or something weird, to restore the system as affected by that
> application to a point before the error. The obvious visual effect
> would be that if an application were to crash, it and potentially
> interrelated applications would suddenly reset to a state a few seconds
> to a few minutes earlier.

Google for "checkpoint-restart" - it's a big field in scientific
computing, where you don't want to lose the results of a 3 week run on a
supercomputer just because the system crashes 5 minutes before it's done.

(Just think - if they'd had a proper checkpointing scheme, most of the
Hitchhiker's trilogy wouldn't have happened... :)

> Maintaining the state is also easy:
>
> - When a file is changed, track the changes and attach them to the last
> state save
> - When memory pages are written to, cache the old copies first
> (unfortunately each page has to be made CoW after every state save)

This is actually a lot harder than it looks - most of the real-life applications
of checkpoint-restart have been to programs that were designed to play nice
with checkpointing. It's *really* hard to do it with a program that wasn't
designed to to be checkpointed, as you noticed yourself:

> This of course raises many questions and concerns that make this
> rediculous and probably not entirely possible:
>
> - What about huge modifications to files in a short time? Make a new
> file, then write 10,000,000,000 bytes past the end and watch it crash.
> - What about lost work in interrelated applications?
> - Will the system state remain consistent?
> - Will it crash over and over and over?
> - Connecting to named pipes? (easily handled, not discussed here)
> - Crashes are usually trappable, and then programs exit cleanly. They
> won't care about this
> - How does a process know to change course if it gets restored?

Exactly the sort of things that make it hard...

Attachment: pgp00000.pgp
Description: PGP signature