Re: Hot pluggable CPUs ( was Linux 2.5 / 2.6 TODO (preliminary) )

From: Andrew Morton (andrewm@uow.edu.au)
Date: Mon Jun 05 2000 - 01:53:15 EST


James Sutherland wrote:
>
> IIRC, one of NASA's recent satellite modules does this with
> memory (radiation in space causes fairly
> frequent bit errors, so they have N+N+N voting via an ASIC);

This was how Stratus' h/w fault tolerant systems operated. Three 68k's
running in lockstep with hardware voting which ignored one CPU if the
other two disagreed with it. It made Stratus a US$400M company by the
early nineties.

Tandem had/have a rather different design. It's still single-box
hardware redundancy, but the CPUs do not run in exact lock-step. The
CPUs do a periodic rendezvous to compare states; if one is wrong it gets
ignored/rebooted. The idea here is that if one CPU fails in software
due to a rare race, the others won't. This might mean that Tandem-aware
applications have a benefit over naive applications. I forget...

After Stratus' chief engineer Robert Reid left the US he came back here
and established a startup (with me in it) to do a similar thing with
SunOS/SPARC. It was ultimately not successful for various reasons, one
of which I believe is "it's not done that way".

Many system failures are due to software. Voted lockstep designs don't
help there.

Many system failures are due to operator errors. Ditto.

Many failures are due to external events (power, lightning, sabotage,
backhoe operators).

In a communications application you can't get more than, umm, 5.5 nines
out of a single box because the building in which that box lives only
has 5.5 nines availability!

The best HA designs rely on the client devices to do the failover.
Check out /etc/resolv.conf sometime :). Many IP applications don't do
that.

If you can, forget about h/w fault tolerance and concentrate on
geographically distributed s/w solutions.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Jun 07 2000 - 21:00:20 EST