Re: Random reboots

From: Jean Delvare
Date: Tue Feb 14 2006 - 17:20:05 EST


Hi Ryan,

> > We recently had such an issue with a dual AMD64 machine rebooting at
> > mke2fs. It turned out it was a faulty power supply. After we changed
> > the power supply, everything ran smooth again.
> >
> > You could start to test by powering your drives from an old AT-style
> > power supply leaving more "juice" for the main board and CPUs.
>
> It's possible, but I doubt it. More often than not, the reboot happens
> when the machine is completely idle - in fact I can't remember a single
> time when it wasn't idle. I just spent a couple months debugging a
> SCSI-tape crash, and I ran the backups a lot and had lots of RAID
> resyncs and it *never* rebooted during either of these events. Anyway
> it has quite a large 2+1 redundant power supply, and, like I said, we
> routinely had 3+ months of uptime with older kernels.

You seem to have hardware monitoring drivers loaded on the system, so
I'd suggest that you watch the returned values over time. If the
hardware is going wrong it might show there. Your system could be
overheating for some reason (stuck fan...)

The fact that older kernels were seemingly working better doesn't prove
much. You were running these kernels before, not now, and hardware
*does* age, contrary to what people seem to think. If you want to make
certain that older kernels were indeed working better for purely
software reasons, you should switch back to such an old kernel and see
if things actually improve or not.

A wild guess while I'm at it... Is the machine behind a KVM switch by
any chance? I have a fun (old) motherboard here which reboots when I
unplug the keyboard and plug it again. Never seen that before...

> During the years I've had this machine, I've experienced at least 10-15
> strange kernel bugs that only happened on this machine. Each and every
> time I was *convinced* that the hardware was at fault (and people on the
> mailing list suggested it) until either a kernel came out that fixed the
> problem or a kernel developer positively identified it as a kernel
> problem and eventually fixed it. This machine just seems to be a magnet
> for kernel bugs.

Note that the first case ("a kernel came out that fixed the problem")
doesn't mean that the hardware was not at fault. There are quite a few
quirks in the Linux kernel code which are just there to workaround
known hardware or BIOS bugs.

--
Jean Delvare
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/