Re: Meaning of a kernel oops?

Linus Torvalds (torvalds@transmeta.com)
23 Sep 1997 02:45:48 GMT


In article <3426DC22.440C@xs4all.nl>, Anne <annekev@xs4all.nl> wrote:
>
>I recently posted two messages to the list reporting kernel oopses with
>squake under the kernels 2.1.54 and 2.1.55. In these messages I also
>gave the oops and the ksymoops output. I received several emails from
>people telling me that probably this is not a kernel problem and that
>it didn't belong on this list.

Oops messages are always something that belongs on the list, so you did
the right thing. Sometimes the oops'es may not be something to worry
about (there have been specific cases of things that resulted in oopses
when a program did something bad, but the oops in itself was not a
problem). But even when the oopses are harmless they should be fixed..

>Now, I was under the impression that an oops was generated when the kernel
>discovered that some of its data structures were broken. Hence, I thought
>that _whenever_ you receive an oops, that this is a symptom of a kernel bug.
>Is this not true?

Mostly. Sometimes the oops is not so much due to a broken data
structure as to just a missing test that should have killed the process
anyway (this has happened in the system call return path, especially -
it's not necessarily the _kernel_ data structures that are broken, but
the user structures on the signal stack etc, and killing the program is
the right thing to do anyway - but it is more graceful to do it without
the oops).

>Unable to handle kernel NULL pointer dereference at virtual address
>00000c10
>current->tss.cr3 = 00745000, (r3 = 00745000
>*pde = 00000000
>Oops: 0002
>CPU: 0
>EIP: 0010:[<c011ccfd>]
>EFLAGS: 00013286
>eax: 00000000 ebx: 0070c063 ecx: 00044000 edx: c073e000
>esi: c001f860 edi: c0101c10 ebp: 00000304 esp: c0079f3c
>ds: 0018 es: 0018 ss: 0018
>Process squake (pid: 371, process nr: 16, stackpage=c0079000)
>Stack: 00001000 c1043000 bffffc58 0000000c 00044000 00000003 00044000 00400000
> c1044000 c0101c10 c011ce83 c1043000 00001000 00000002 00000002 c1006262
> 0000000c 800c5012 c01a1b00 bffffc58 00000007 c009c1e0 00000000 00000003
>Call Trace: [<c1043000>] [<c1044000>] [<c0101c10>] [<c011ce83>] [<c1043000>] [<c1006262>] [<c012a8be>]
> [<c010924a>]
>Code: 89 1c a8 8b 52 44 81 fa 00 60 10 c0 75 e5 8b 44 24 2c 05 00
>
>Using `/System.map' to map addresses to symbols.
>
>>>EIP: c011ccfd <vmalloc_area_pages+1f9/23c>
>Trace: c1043000
>Trace: c1044000
>Trace: c0101c10 <swapper_pg_dir+c10/1000>
>Trace: c011ce83 <vmalloc+3f/5c>
>Trace: c1043000
>Trace: c1006262
>Trace: c012a8be <sys_ioctl+14e/164>
>Trace: c010924a <system_call+3a/40>

You seem to be using modules, and the stack trace isn't really readable.
Can you recompile your kernel with the necessary modules as non-modules
and try again? Even if that makes the problem go away it's at least a
pointer to the problem, and it would be good to know exactly _which_
module is broken, for example.

Linus