Re: PROBLEM: BUG: Constant freezes and kernel panics on a quad core(with dumps)

From: Bruno Barberi Gnecco
Date: Thu Dec 03 2009 - 11:28:22 EST



Regarding the PS, I have checked voltages with a multimeter and they are
more than fine, and the wattage is enough for the system, so it'd have
to be a very weird transient glitch that affects only memory access. See
also below.
Most of the time transients will be the issue when a power supply causes problems and that can't be seen with a normal voltmeter. It's not typical for the rails to be low all the time unless the power supply is heavily overloaded.

Or stone cold dead.

You can't check any PSU with any multimeter I've ever seen unless it's a
catastrophic failure, or as you said, so overloaded that it can't
regulate (in which case it would have shut down if it were decent
quality...). Non-catastophic PSU failures are often filter problems
that a multimeter isn't fast enough to see. Many switchers are
deplorably noisy, and rely on the caps at the end of the transmission
line, so one poor quality or dried out cap on MB can screw the pooch
too.

Any ideas to rule the MB out, other than "get a new one"?

Bad memory (memtest doesn't necessarily access things the same way as
the kernel)
Ruled out. I replaced with a 2GB DDR2, still got the bug: "BUG: Bad page
map in process".

Bad cards (pci, agp, whatever)
Ruled out. The only card is the video card. I replaced it with a very
old PCI board and still got error. This also pretty much rules out that
the PS is underpowered, since I powered only the MB and the HD.

Could it be one of the onboard things? I disabled everything but the
LAN, and still got it.

Any of the above with loose connections

Pay very close attention to cleanliness. Dust works it's way into
connectors with vibration. Pull ram, and reseat. Resist the urge to
clean any connector with anything other than no-residue contact cleaner.

Another thing to watch out for is crappy heat sink compound. That dries
out, doesn't conduct heat well enough. Under load, such a problem may
build VERY fast with modern CPU current draw. If all else fails, pull
your CPU heatsink, clean and re-apply fresh compound.

I already reconnected everything twice. Could still be a loose
connection of one of the wires in the connector, but it's very very
unlikely to give such a specific error on memory access.

And did I mention bad power supply?
Yes you did, and I'll try to get another one to be sure, but it could
still be a software bug too.

Yes, but try another unit. PSU is THE odds on favorite for random crap
with everything from PC hardware to very high dollar HW. It's the point
of maximum electrical stress. It's also a spot where many people try to
save money... big mistake that.

(removes HW guy hat;)

Follow-up, with thanks to everybody who helped: I tried a different PSU and still got the problem, and I also got a BSOD with Windows. So it seems to be a problem with the motherboard or the processor.

Thanks a lot again,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/