Hard to debug kernel issues (was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning)

From: Chris Samuel
Date: Mon Jan 12 2009 - 02:59:45 EST


On Sun, 11 Jan 2009 11:26:41 pm David Woodhouse wrote:

> Sometimes you weren't going to get a backtrace if something goes wrong
> _anyway_.

Case in point - we've been struggling with some of our SuperMicro based
systems with AMD Barcelona B3 k10h CPUs *turning themselves off* when running
various HPC applications.

Nothing in the kernel logs, nothing in the IPMI controller logs. It's just
like someone has wandered in and held the power button down (and no, it's not
that).

It's been driving us up the wall.

We'd assumed it was a hardware issue as it was happening with all sorts of
kernels but today we tried 2.6.29-rc1 "just in case" and I've not been able to
reproduce the crash (yet) on a node I can crash in about 30 seconds, and
rebooting back into 2.6.28 makes it crash again.

If the test boxes are still alive tomorrow I might see if we can attempt some
form of a reverse bisect to track down what commit fixed it (git doesn't seem
to support that so we've going to have to invert the good/bad commands).

cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Attachment: signature.asc
Description: This is a digitally signed message part.