Re: Same MCE on 4 working machines (was Re: Early boot hang on recent 2.6 kernels (> 2.6.3), on x86-64 with 16gb of RAM)

From: Robin Lee Powell
Date: Fri Sep 15 2006 - 19:19:27 EST


Found away to get around the large RAM issue; see below.

On Fri, Sep 15, 2006 at 01:31:59PM -0700, wrote:
> On Fri, Sep 15, 2006 at 09:50:39PM +0100, Alan Cox wrote:
> >
> > You also have a lot of RAM, that shouldn't matter but it means
> > you hit code paths most users don't. If you boot with mem
> > limited to 1GB I assume it still blows up ?
>
> I've tried mem=1023M, yes, and it still blows up. Just did
> acpi=off mem=1023M to check.

I've found a server with the same hardware except only 2GiB of RAM.
The behaviour is slightly different. It restarts instead of
hanging, and the last bit is:

Initializing CPU#0
PID hash table entries: 4096 (order: 12, 32768 bytes)
Disabling vsyscall due to use of PM timer
time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
time.c: Detected 1804.115 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Memory: 2059540k/2096576k available (2584k kernel code, 36348k reserved, 1198k data, 220k init)
Calibrating delay using timer specific routine.. 3611.41 BogoMIPS (lpj=18057088)
Security Framework v1.0.0 initialized
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
í

Not the wierd character at the end; there's always one or two of
them, but that could just be our Cyclades console servers doing
something odd.

At that point, the machine reboots.

I have found no way yet to get any other behaviour; acpi=off, in
particular, doesn't give me the MCE on this box.

I've tried all your pci= options, too, with no effect.

I tried "nosmp noapic mem=512M ide=nodma apm=off acpi=off desktop
showopts".

I tried iommu=off.

I tried Debian's 2.6.8-11-amd64-generic, which on the 16GiB boxes
went straight to the MCE; it stopped at the same place, but seems to
have hung instead of rebooting. Still didn't get as far as the MCE.

Nothing seems to make a difference.

But 2.6.2 boots right up, no troubles.

Just to make sure that the machines really were the same, I pulled
lspci -v from this smaller-RAM one. They are *exactly the same*.
Right down to the IRQs. You can see them at:

16gb: http://teddyb.org/~rlpowell/media/regular/lkml/lspci_v.txt

2gb: http://teddyb.org/~rlpowell/media/regular/lkml/devnutch1-lspci_v.txt

(you can see they're not the same file, because the whitespace came
out differently :-)

Here's some BIOS options that look maybe relevant, just in case:

4GB Memory Hole Adjust [Auto]
4GB Memory Hole Size [128 MB]
IOMMU: [Enable]
Size: [64 MB]
Multiprocessor Specification: [1.4]
Use PCI Interrupt Entries in MP Table: [Yes]

ACPI Enabled: [Yes]
ACPI SRAT Table [Enabled]
Spread spectrum modulation [No]
Suppress Unused PCI Slot Clocks [No]

-Robin

--
http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/
Reason #237 To Learn Lojban: "Homonyms: Their Grate!"
Proud Supporter of the Singularity Institute - http://singinst.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/