AMD K6-2/400 and FIC VA-503+BM with Linux 2.0.36

KOEHLEKR@UCRWCU.RWC.UC.EDU
Mon, 22 Feb 1999 8:16:31 -0500 (EST)


To: Linux Kernel Listserver
AMD Tech Support
FIC Tech Support
R.E. Wolff
(and anyone else you care to pass this on to whom you
think might be interested)

From: Kenneth R. Koehler, PhD
Associate Professor of Physics
Raymond Walters College
University of Cincinnati
(koehlekr@ucrwcu.rwc.uc.edu or kenneth.koehler@uc.edu)

Re: AMD K6-2/400 and FIC VA-503+BM with Linux 2.0.36

I have a hardware problem which I need some help with.

I recently purchased 4 AMD K6-2/400 CPUs and 4 FIC VA-503+BM
MBs with which I am building a Beowulf cluster for parallel
algebraic computations in membrane theory. In testing the
hardware prior to parallelizing my application, I encountered
the following types of problems:

- random seg faults and parsing errors during kernel build
- random silent deaths, illops and seg faults in wmclock
(always same address when a core file is generated:
illop = 80493c2, seg fault = 80493c8)
- subtle variation in bogomips from one reboot to the next
(always 799.54 or 801.18)
- emacs under X seg faults after above problems appear
(always same address - 401e4811)
- hard system hangs (must reset)

The test bed with which I reproduce these problems is a combination
of a HIGHLY cpu-bound C program (133Kb output in 12.5 hours at 400MHz,
size > 48 Mb, RSS > 24 Mb during tests) coupled with a loop of kernel
builds. Changing BIOS parameters (ie., wait states) had no effect.

I have run this test bed on all MBs and all CPUs, with both PC-100
memory (8 ns SMT chips) and non-PC-100 memory (Hyundai 10 ns chips).
With the PC-100 memory, failures begin in minutes; with the non-PC-100
memory, in almost 7 hours I had many wmclock silent deaths, 2 wmclock
seg faults and 2 kernel build seg faults. Clearly I have a timing
problem. So I then jumpered a MB for 350 MHz, and as I compose this
message, the test bed has run for over 9.5 hours with no failures.

The motherboards are PCB 1.2, serial numbers LA9004493, 4500, 9812 and
10868, all with 1 Mb of L2 cache. The CPUs are K6-2/400AFQ (Linux reports
stepping M), serial numbers 16-007905, 6, 7 and 8. The BIOSs are
Award 1.15JE33S. I am using RedHat 5.2 with kernel 2.0.36. All
configurations have 64 Mb SDRAM (1 DIMM).

I would like to find some way to ascertain which component is
responsible for the failures. I suspect the CPUs, but with all of
my 29 years of computing experience, I cannot say that is more than
a feeling. I have run memtest86 for hours with no failures.

Since my application often runs for weeks, and the computations are
too complex ever to be done by hand, it is critical that I get as
much speed as I can reliably get from these systems; but the reliability
is paramount.

If any of you have any ideas or similar experiences, please e-mail me
at the address above.

Thank you for your time and attention.

Ken

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/