Re: 2.2.10 oops (finally, something I can report!)

Linus Torvalds (torvalds@transmeta.com)
30 Jun 1999 21:02:57 GMT

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Ralf Baechle: "Re: benchmarks and baits"
Previous message: Bret Orsburn: "Re: Large Shared Memory"
Maybe in reply to: Aaron Lehmann: "2.2.10 oops (finally, something I can report!)"

In article <Pine.LNX.4.05.9906300304020.7161-100000@vitelus.com>,
Aaron Lehmann <aaronl@vitelus.com> wrote:
>This time I had the fortune of an Oops that didnt lock up the machine. I'm
>going to apply KMSGDUMP so I can send all future oopses also.
>
>I hope this helps fix the stability problems:
>
>Reading Oops report from the terminal

Interesting.

The oops looks fine. The symbolic information also looks fine: the code
in question does in fact look like it is the second instruction in
"inet_sendmsg()". Everything basically seems to say that the oops is
correctly decoded and caught.

The thing that does NOT make sense is the cause of the oops itself,
though.

The oops happens on

c017b651 pushl %ebx

and %esp = c3941e80.

And quite frankly, there's not a way in h*ll that that instruction could
raise the exception in question. But it does.

I would _strongly_ suspect one of two things:
- bad CPU.
- bad cache or RAM timings.

Basically, the instruction cannot raise that exception with those
inputs. So either the CPU is just doing something randomly wrong due to
internal corruption, OR the CPU gets fed the wrong data at some earlier
point, and when the exception happens and we re-fetch that data, now it
is magically ok again because the timings were better this time.

Or something.

Note that the "bad CPU" thing may have been brought about by the MTRR
changes: maybe Linux sets up some Cyrix CPU state (it was a Cyrix CPU,
right?) incorrectly.

Oh, and do you get the message

Cyrix processor with "coma bug" found, workaround enabled

at booptup? Maybe that workaround does something else bad.

So I would strongly suggest turning off MTRR support, and see if the
behaviour is more reliable.

I would also suggest making sure that everything is properly cooled:
overheating can easily result in random problems - corrupting internal
CPU state resulting in basically random behaviour.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ralf Baechle: "Re: benchmarks and baits"
Previous message: Bret Orsburn: "Re: Large Shared Memory"
Maybe in reply to: Aaron Lehmann: "2.2.10 oops (finally, something I can report!)"