Re: [RFC/SERIOUS] grilling troubled CPUs for fun and profit?

From: Dave Jones
Date: Mon Jun 19 2006 - 16:22:44 EST


On Mon, Jun 19, 2006 at 04:00:06PM -0400, linux-os (Dick Johnson) wrote:

> > arch/i386/kernel/doublefault.c/doublefault_fn():
> >
> > for (;;) /* nothing */;
> > }
> >
> > Let's assume that we have a less than moderate fan failure that causes
> > the CPU to heat up beyond the critical limit...
> > That might result in - you guessed it - crashes or doublefaults.
> > In which case we enter the corresponding handler and do... what?
>
> The double-fault is just a place-holder. The CPU will actually
> reset without even executing this (try it).

Wrong.

Why do you think we go to the bother of installing a double fault handler if
we're going to reset? Why would we go to the bother of printk'ing
information about the double fault if we're about to reset faster than
it would get to a serial console ?

The box intentionally locks up, so we have a chance to know wtf happened.

> A CPU without a fan will go into
> a cold, cold, shutdown, requiring a hardware reset to get it out of
> that latched, no internal clock running, mode.

Wrong.

> Try it. I have had
> broken plastic heat-sink hold-downs let the entire heat-sink fall off
> the CPU. The machine just stops.

Your single datapoint is just that, a single datapoint.
There are a number of reported cases of CPUs frying themselves.
Here's one: http://www.tomshardware.com/2001/09/17/hot_spot/page4.html
Google no doubt has more.

Another anecdote: Upon fan failure, I once had an athlon MP *completely shatter*
(as in broke in two pieces) under extreme heat.

This _does_ happen.

> Also, the CPU was only warm to the touch, having been completely shut down for the
> several minutes it took to locate tools to remove the cover, even
> though I deliberately left the power ON.

So you got lucky. I've blistered a thumb on hot CPUs before now
after fan failure.

> In the first place, when the Intel and AMD CPUs overheat, they
> shut down.

Reality disagrees with you.

> For sure, it might be nicer to have some call-and-never-return
> function for waiting with the rep-nop code, but it isn't necessary
> for CPU protection.

cpu_relax() and friends aren't going to save a box in light of
a fan failure in my experience.
However for a box which has locked up (intentionally)
running instructions that do save power in a loop has obvious advantages.

Dave

--
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/