Three kernel Oops/panic/BUG ksymoopses (kernel BUG at buffer.c:539)

From: Erik Bourget
Date: Fri Jan 02 2004 - 18:34:56 EST



(crossposted between linux-kernel and netfilter users because some traces do
go into nf_* and I'm using the conntrack module)

Hello;

I had a very bizarre situation where four boxes in the same rack all
simultaneously (within 30 minutes) hard-locked with Oops messages. The boxes
don't even have the same function - two of them are MXs that face the
Internet, and the other two are spamassassin spamd boxes that receive
messages, put a tag header up top, and send them back to the MX for
filtering. It was during a very (ridiculously) high-load situation where all
four boxes were running at their limit for a few hours.

The kernel running on them was 2.4.23, patched from netfilter's patch-o-matic
to include the ipt_connlimit module. Connlimit was active on the MXs, but not
on the spamd boxes. .config-file is attached.

I captured four panic messages. The results of ksymoops on these are
attached. The one with a 'kernel BUG at buffer.c:539!' was a spamd box, the
other three are from MXs (RESULT-4 is from one of the MXs crashing after this
incident). Sadly, the stack traces don't appear to necessarily have anything
to do with each other.

The boxes have since had sparodic crashes, but not in unison like the first
time.

The MXs have been backgraded to 2.4.22+connlimit patch. One of them has
crashed since, I could not grab the output as it was rebooted via a remote
power box.

Hardware:
MXs: Dell PE 1650, dual P3 1.13GHz, 1024MB RAM, aacraid scsi
controller with 15k rpm drives.
spamd: Dell PE 1750, dual P4 'Xeon' 2.4GHz, 1024MB RAM, Fusion MPT
scsi controller.

The MXs have been functioning for a year as either nameservers or
MXs-that-don't-do-spamassassin for some time flawlessly. The spamd
boxes are new.

Software:
All debian woody, linux 2.4.23 with ipt_connlimit as module (not
loaded on the spamd machines). MXs run qmail and thrash the disk
quite a bit. All filesystems reiserfs. SMP kernels, no ACPI or APM
etc.

Is there a kernel problem here? If not, does anyone familiar with the chi of
kernel stacktraces have any advice?

Thanks,

- Erik Bourget

Attachment: config-2.4.23
Description: config

Attachment: RESULT-1
Description: Binary data

Attachment: RESULT-2
Description: Binary data

Attachment: RESULT-4
Description: Binary data