Re: Hardware problem detection? Linux 2.4.3-pre4 lockups.

From: Chris Meadors (clubneon@hereintown.net)
Date: Thu Mar 15 2001 - 12:32:27 EST

Next message: Torrey Hoffman: "RE: Is swap == 2 * RAM a permanent thing?"
Previous message: Matthew Callaway: "Network init script causes 2.2.18 kernel oops (tulip driver)"
In reply to: Nathan Black: "Hardware problem detection? Linux 2.4.3-pre4 lockups."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 15 Mar 2001, Nathan Black wrote:

> I am at a total loss, But I have found some interesting anomalies with my
> hardware.

That is about how I was feeling when I had similar problems.

> My Current Setup:
> Supermicro S370DE6 (Serverworks Chipset)
> Dual PIII 866
> 2 x 256 MB PC133 ECC SDRAM
> onboard AIC 7899 SCSI Controller.
> 36G,73GB Seagate Cheetah Drive.
> Voodoo4 4500 AGP video,
> Fore PCA 200e ATM

My setup was (is):
Tyan Thunder 2500 (Serverworks)
Dual PIII 667s
2 x 512 MB PC133 ECC SDRAM
Onboard dual SYM53C896 controller
5 18.2 GB Seagate Cheetahs
nVideo Vanta
Onboard Intel 10/100

> Problem, I have a program the can read a file(large, or small) it will then
> transmit the data over atm, ethernet, localhost,or write to a file.

I could move a lot of network traffic as long as I wasn't hitting the disk
too hard.

> I have noticed that the machine will consistently crash(hard lockup) when I
> do a read loop of the File. It never locks up at the same place, and I have
> changed it so that it never actually does anything with the data after it
> reads. Still, same result.

Any time I pushed the disk subsystem hard I would get a lockup. Sometimes
the kernel would oops, the program writing to the disk would segfault, but
always the machine locked hard.

> Things that have "fixed" the problem. Setting the FSB to 100(jumpered) will
> allow me to run forever.
> Also, Setting the L1 Cache to Write-through instead of write back will allow
> me to run forever at 133, but the performance hit is worse than setting the
> FSB to 100.

If I forced things to run slower I could run longer, like changing the
cache setting, never tried the FSB setting. But even with the machine
slowed down I could eventually lock it up if I pushed the disks hard
enough (12 bonnies at the same time would always do it).

> Another note. When I have attempted to compile the kernel for Uni processor.
> I started getting segmentation faults with gcc.
> Now this tells me it might be the processor. But I have nothing overclocked,
> so I would think that it might be some kind of timing issue in the kernel.

I saw so much strange stuff I couldn't pin it down to one thing, except
perhaps the processor.

> I have two machines set up this way. One is much more stable. But I do
> observe the occasional crash.(hard lockup)

I too had two identical machines. I was doing all my work on one, and was
planning on copying the finished product over to the second when I was
done. After I started suspecting the hardware, I started up the other
machine. It ran perfectly. I could push it as hard as I wanted with no
trouble at all.

> I have also seen fsck crash as well. when the system was not shut down
> correctly. ( as a hard lockup happens very frequently.)
>
> Here are some things that I have tried, but Have not fixed it.
> 1) SMP Kernel with "noapic" at lilo prompt ( and without the noapic)
> 2) Uni Kernel w/ & w/out apic
>
> I am at a total loss.
> Is there anything I can do(other than run at 100 FSB)?
>
> Nathan
>
> P.S. I have enclosed the dmesg output for my Uniprocessor kernel
> <<dmesg.out.uni>>

In the end I started swapping processors between the two machines. I
found the problem followed 1 of my processors. I called Intel and after 2
days of convincing they RMAed my old processor and sent me a replacement.
Both machines have been running perfectly since then. If you have any
more processors I'd try swapping them around. But since you are seeing
problems with 2 similar machines, I wouldn't get my hopes up as to this
being the solution.

-Chris

-- Two penguins were walking on an iceberg. The first penguin said to the second, "you look like you are wearing a tuxedo." The second penguin said, "I might be..." --David Lynch, Twin Peaks

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/

Next message: Torrey Hoffman: "RE: Is swap == 2 * RAM a permanent thing?"
Previous message: Matthew Callaway: "Network init script causes 2.2.18 kernel oops (tulip driver)"
In reply to: Nathan Black: "Hardware problem detection? Linux 2.4.3-pre4 lockups."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Mar 15 2001 - 21:00:18 EST