Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM

From: Robin Lee Powell
Date: Tue Sep 12 2006 - 18:34:09 EST



Please cc me on replies, as I'm not on the list.

I have some moderately old hosts that hang on boot, very early on,
with any kernel newer than 2.6.3. Important basic facts about the
box are dual opteron 244s, 16gb of RAM, and it's a 64-bit build of
the kernel. We've tried both MK8 and CONFIG_MK8 and
CONFIG_GENERIC_CPU to no avail. Hell, I think I've tried just about
everything at this point.

In most cases the stopping point is:

[snip]
Initializing CPU#0
[snip]
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 0/0(1) -> Node 0 -> Core 0

(see below for complete version)

Whether that last line shows up or not is based on BIOS configs,
AFAICT.

I've been working on these for several days now, and I'm totally
stumped. I'll generate any data any of you ask for that might be
useful, and I'll be appending a lot of data here.

The kernel I'm actually trying to get working is 2.6.17.11, but I've
tried all of the following:

linux-2.6.2
linux-2.6.3
linux-2.6.4
linux-2.6.6
linux-2.6.10
linux-2.6.17.11

Only the first two work.

For all of them except 2.6.17.11, the config was just the config
from the known-working 2.6.2, make oldconfig, hold down the return
button. 2.6.17.11 I played with a bit more.

The original 2.6.2 working .config is
http://teddyb.org/~rlpowell/media/regular/lkml/orig.config.txt

The current 2.6.17.11 .config is
http://teddyb.org/~rlpowell/media/regular/lkml/current.config.txt

The full boot with apci on is
http://teddyb.org/~rlpowell/media/regular/lkml/apci_boot.txt

The full boot with apci off is
http://teddyb.org/~rlpowell/media/regular/lkml/noapci_boot.txt

This version is rather different, as it ends in:

HARDWARE ERROR
CPU 0: Machine Check Exception: 7 Bank 3: b40000000000083b
RIP 10:<ffffffff80446e3e> {pci_conf1_read+0xbe/0xf0}
TSC 2e7932dbf8 ADDR fdfc000cfc
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Uncorrected machine check

We believe this to be spurious, as both machines of this type we've
tested showed the same error, and both of them have been running
with 2.6.2 for years.

The motherboard is, apparently, a RioWorks Rhapsody HDAMA, whatever that is.
We are not on the latest BIOS revision, but then the changes between 1.84
(which we're on) and 1.89 don't look relevant.

BIOS information is
http://teddyb.org/~rlpowell/media/regular/lkml/bios_info.txt

lspci -v (generated while up on the 2.6.2 kernel that's been running
for years) is
http://teddyb.org/~rlpowell/media/regular/lkml/lspci_v.txt

dmidecode is
http://teddyb.org/~rlpowell/media/regular/lkml/dmidecode.txt

-Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/