Re: Hot plug vs. reliability
From: Matthias Fouquet-Lapar
Date: Thu May 27 2004 - 10:15:56 EST
> I agree, in this case there is no loss of MTBF.
> Yet let's call this activity as run time re-partitioning of the machine.
> (Most people - me too - consider hot plugging as physically plugging
> things in / out.)
You're right, it's confusing and I made the same assumptions you make :
physically moving parts. (and I worked on a systems a couple of years
back where we actually had hotswap :-))
> But the new comers are tested in a different environment, with
> different tolerance range. I just simply do not trust :-)
Not really. It's up to the vendor and at least here at SGI we have pretty
tight rules and tolerances.
> I do not think the timing / the delays are auto adjusting. You select
> a component X to work next to the component Y because you know that
> X in "here" and Y in "there" in the tolerance range...
They do (impedance match). An example are SRAMs used for CPUs with external
caches for example. I've learned a lot about that :-)). You also
have stuff like auto-learning for echo-clock timings etc, but this is really
very platform and CPU specific
> I think the OS has to be platform independent. How can a platform independent
> OS know if <n> errors of this / that type requires what intervention ?
> We'll have the same binary of the OS (+ drivers) for a small desk top or
> for a 32 CPU "main frame". Only the firmware is different...
An OS is never platform independent, there always is a machine dependant layer.
I'm not really concerned about the total numbers of errors in a system,
regardless if we have one, 32 or 512 CPUs. If we see a component starting to
fail, it should be isolated in order to avoid catastrophic failure
> Most of our clients just do not want to touch their 10 year old rubbish
> Fortran programs. If I get a hint of danger (today it does not come from the FW)
> I could take a check point and call for service intervention...
That's a well know problem (although I think 20 years or more are more
likely ...)
I think however there are new applications coming up using large or
ultra-scale systems where more fault tolerance can be designed in at the OS,
libarary or even user level
Amicalement
Matthias Fouquet-Lapar Core Platform Software mfl@xxxxxxx VNET 521-8213
Principal Engineer Silicon Graphics Home Office (+33) 1 3047 4127
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/