Re: Inter-Kernel Communications (Multi Kernel Clusters)

Keith Rohrer (kwrohrer@uiuc.edu)
Mon, 24 Feb 1997 17:08:09 -0600 (CST)


> > There are some funny special cases that need to be considered. How
> > does your proposal handling failing components in the cluster? Imagine
> > a network failure that split your cluster in two parts, each fully
> > functional but unconnected to the rest. The so called split-brain
> > syndrome. Now each cluster half will continue processing and assumes
> > it is authoritive? Imagine a database system being split up into
> > two systems ...
> the problem is that there is no 'absolute' authority function available.
> So we should simply ignore this problem.
How to deal with such results of network partitioning is a topic for
research, and likely the solutions are service-specific (i.e. a replicated
filesystem would have to deal with it differently than, say, a namespace
server). However, the usual assumptions that machines which discover a
failure must mask that failure completely and will never know how down
a suspected-down machine really was may be too strong, especially for
systems where operators are on duty at all times.

> so the solution is to define 'trusted' authoritive media (an additional
> network of serial lines or something similar), and swear loudly when a
> split happens despite these measures ;)
I'd actually suggest multiple routing paths among servers, preferably
with multiple network cards on each critical and backup machine. If
you pick appropriate topologies (and detect network failure reasonably
promptly) it should be much harder to take a machine off the net entirely,
let alone partition the servers. Limiting the set of machines which can
provide critical services can also minimize failures: if every time a
ppp link goes down all the machines in someone's house stage a revolution,
you may well have problems.

Keith