Oops on my 486, kernel 2.2.13.

Brent M. Smith (brent@calwestmu.com)
Tue, 16 Nov 1999 21:05:38 -0800


Ok, after about two days of investigating, I believe I've found the source of
my problems. Let me start out at the beginning, just for clarity.

* Installed the latest version of slackware, 7.0 on an old 486 box.
* Recompiled the kernel, but accidentally took out support for Unix domain
sockets.
* Rebooted, kernel boots up, and then hangs for about a minute after the INIT
message pops up. (INIT: versoin 2.76 starting). Then I see an OOPS fly by,
which I cannot read, because the machine reboots immediately. I must find a
way to read the OOPS!
* I post a message here, and find where to get the kernel patch for IKD. I
patch the kernel with the debugger, and buy a null modem cable from my local
computer store. I also recompiled the kernel with serial console.
* After getting serial console to work, I boot up the kernel and finally get
the oops to appear. I try to run it through ksymoops, but the "Code" line is
missing so it doesn't work. I look in System.map, and resolve the EIP to
being the "sock_close" function + 27 bytes.
* Not knowing what to do, I take a look at the socket code. Nothing weird
going on in sock close. (I even recompiled the kernel with a check for a
returned NULL pointer from socki_lookup). Rebooted, and still got the oops.
* When I rebooted, I used the ps utility in IKD to check out processes. I
noted that there were 200 process, and almost all of them were "sh" and
"modprobe". UH OH! Module loading problem. However, I distinctly recalled
that I had made my kernel strictly monolithic. You may be able to guess my
mistake.
* Stupidly, I had forgotten to delete /lib/modules/2.2.13/, before rebooting.
So the old slackware modules, that came with the generic slackware kernel,
were still there. (And were probably compiled with symbol versioning).
Anyway, I figured that the newly compiled kernel was tring to load the old
modules, and when it loaded them it Oopsed for some reason. (I have no idea
what exactly was going on here)
* So I removed the /lib/modules/2.2.13/, and the kernel boots up fine, until it
starts repeating "can't locate module net-pf-1". I didn't know what this was
at the time, so I grepped /usr/include, and the kernel source until I found
out that it was the PF_LOCAL, or PF_UNIX family. DOH!! Stupid, Stupid error.

Anyway, I'm sorry to write so much, but I felt a need to let people know this
might happen. Maybe we can put a little printk in socket.c if (family ==
PF_UNIX (1)), and say something like, "You forgot to put alias net-pf-1 in
modules.conf, or you forgot to compile with Unix domain sockets." I spent
nearly 8 hours or so debugging this, so maybe it would save time in the future.
Or maybe not. Or maybe we can find out why it was oopsing trying to load the
other (old) module.

I'd be happy to submit a patch for the PF_UNIX check, if anyone cared.
Thanks for listening.

-- 
    Brent M. Smith, <brent@calwestmu.com>

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/