Re: 2.2.10 oops (finally, something I can report!)

Walter Reed (walt@itrade.net)
Wed, 30 Jun 1999 16:43:21 -0700


Just because you "normally" don't have problems doesn't mean that you still don't
have a hardware problem.

Just a story to relate here, I had a system running 2.0.xx for about 2 years,
(pretty much as soon as 2.0 came out) and this system had over 200 day uptimes.

I installed 2.2.5 (redhat 6), recompiled the kernel, and started having
spurious crashes. After tearing my hair out over this stuff for about a week,
it turned out to be bad RAM.

>From that day forth, I decided to always make sure I bought good quality
ECC RAM.

FYI, I've also read recently on lkernel about someone else who ran extensive
memory tests, but the bad ram didn't show up until he did the "recompile
the kernel a few times" test. Replacing RAM fixed his problem too.

Considering the price of RAM and low-end pentium chips these days, it's worth
swapping in a real pentium with a good heat sink/fan, new ECC RAM, and see what
happens... If you still have probs, you can always return the parts.

IMHO, "Something" about 2.2 stresses the machine a bit more than 2.0, and this
seems to make intermittant hardware problems expose themselves. This is probably a
"good thing."

So I guess the moral of the story is, don't dismiss hardware problems too quickly.
They may surprise you.

On Wed, Jun 30, 1999 at 09:43:09PM +0000, Aaron Lehmann wrote:
> Just a clarification - In my original message to the list I wasn't trying
> to complain about specifically my stability problems, but I had seen a lot
> of Oopsen on the list and I wanted to comment on the general situation of
> 2.2.x stability, including my own expriences. Sorry if it sounded like I
> was complaining about how Linux crashed for me, if it hadn't been for all
> the oopsen I saw on the list I would have suspected a hardware problem.
>
> On Wed, 30 Jun 1999, Linus Torvalds wrote:
>
> > In article <Pine.LNX.4.05.9906300304020.7161-100000@vitelus.com>,
> > Aaron Lehmann <aaronl@vitelus.com> wrote:
> > >This time I had the fortune of an Oops that didnt lock up the machine. I'm
> > >going to apply KMSGDUMP so I can send all future oopses also.
> > >
> > >I hope this helps fix the stability problems:
> > >
> > >Reading Oops report from the terminal
> >
> > Interesting.
> >
> > The oops looks fine. The symbolic information also looks fine: the code
> > in question does in fact look like it is the second instruction in
> > "inet_sendmsg()". Everything basically seems to say that the oops is
> > correctly decoded and caught.
> >
> > The thing that does NOT make sense is the cause of the oops itself,
> > though.
>
> Another kernel hacker pointed this out, but I did not know what it meant.
>
> > The oops happens on
> >
> > c017b651 pushl %ebx
> >
> > and %esp = c3941e80.
> >
> > And quite frankly, there's not a way in h*ll that that instruction could
> > raise the exception in question. But it does.
> >
> > I would _strongly_ suspect one of two things:
> > - bad CPU.
> > - bad cache or RAM timings.
>
> I don't want to troubleshoot a hardware problem on linux-kernel, but I
> strongly suspect that the CPU or ram is not at fault. I have been running
> Linux on this machine ever since September and never changed any bios
> settings (except enabling apm monitor blanking) since then. Heat is not a
> problem since the machine is idle most of the time and oopsen usually
> occur at a load level below 0.10, which is where the machine is at
> usually. Running processor-intensive tasks for hours does not seem to
> trigger anything, even on a hot summer day.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/