2.2.13pre{4-14} UP hangs

Paul Cassella (pwc@andrew.cmu.edu)
Thu, 30 Sep 1999 18:58:53 -0400 (EDT)


This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.
Send mail to mime@docserver.cac.washington.edu for more info.

--41954960-1325920753-938732333=:761
Content-Type: TEXT/PLAIN; CHARSET=US-ASCII
Content-ID: <Pine.LNX.3.96L.990930161754.761C@manetheren.res.cmu.edu>

I've been having fairly random hangs every two or three days. Typically,
it will involve the following:

- Keyboard led's, alt-sysrq[1], ctrl-alt-del non-responsive; the system
doesn't answer pings.

- If an mp3 is playing, a single tone will continue until the machine is
turned off.

- If a cd is playing, it will continue playing.

- This being a new Dell, the power-off button is apparantly partially
software controlled. What seems to me to be significant about this is
that it normally turns the machine off when it's first pressed. When
the machine is hung, though, it needs to be held in for a few seconds.

- The mouse cursor in X freezes.

Three times my uptime has been under an hour before hanging. It's usually
over a day, and sometimes nearer three days.

I've run 2.2.13pre{4,6,10,11,14} and had the same result. I'd tried the
2.2.12 ikd patch from e-mind for the pre10 kernel, and Andrea's ikd5 patch
from suse.com on pre11 and 14, all with the MEMLEAK_ALLOC_NOLOCK after the
line after area->count-- in page_alloc.c.

They've all been compiled with egcs 2.91.66 19990314 from redhat's
egcs-1.1.2-24.rpm

The pre10 kernel with ikd, with the semaphore and softlock checks on, hung
several times. For the pre11 and 14 kernels, I'd turned off
CONFIG_DEBUG_SOFTLOCK, as per Andrea's message of Sept 23. None of these
produced any additional output, although that may be because they all
occurred with the console in X.

X has been running for every hang. It's happened with X-S3 and X-SVGA,
starting with whatever RH 6.0 shipped with, through the rawhide 3.3.5-2
rpms. Almost all of the hangs have occured while I was at the console (X)
and playing mp3s, although there have been a few times when it had frozen
while relatively idle, or just a cd playing. I'm not running an rc5 or
seti client, or any long-running cpu-intensive stuff.

Nothing else is as obvious. None of the smbfs, nfs, ide-cd, or cdrom
modules have been loaded during all the crashes (although each has been
loaded for at least one), and there have been crashes when none of them
has been loaded since booting. The only other modules I use commonly are
tulip, es1371, and soundcore, which are almost always loaded.

The only two hangs which I noticed being different were:

After running multiple tar -zxf and diff's, and compiling kernels -j16 for
a while, and having an uptime of a few days without ever have started X, I
started X, and switched back to console after the enlightenment startup
started. I started typing "mount -o remount,ro" when it hung. No
messages appeared. This was the only hang I've had in console mode.
Also, no sound was playing, although if I'd figured it out (explained
later) by then, the modules were probably loaded since it's a redhat box.
(Hmm. Does enlightenment/gnome start esd automatically?)

The other was instead of the sound staying at a single tone, it kind of
tapered off and stopped. I think netscape had stopped responding to the
mouse a second or two before the mouse stopped responding, but I could
have imagined that. When I pushed the power button, it turned off
immediately. Since I hadn't thought the hang was that different until
this point, I didn't try pinging it or playing with the keyboard. This
doesn't seem like the same crash at all; more like userspace not getting
scheduled rather than the kernel sitting somewhere with interrupts off.

I don't remember which kernels these were, except that neither was with an
ikd kernel.

The first couple hangs were with the demo OSS drivers, because I didn't
realize that the in-kernel es1371 driver would run my soundcard. The rest
have been with es1371 from the kernel tree. I don't know if it's ever
crashed without either module having been loaded.

The hangs have occured with the tulip.c included in the kernel (0.89H),
the latest stable one from Don Becker's page (0.91), and the tulip.c
provided by Netgear (based on 0.89K, apparantly), all as modules.

I don't think it's hardware in the sense of bad memory or cpu because it
did the gunzip/tar/diff and make zImage -j 16 without problems, and
nothing besides netscape is dying randomly. I haven't seen (noticed) any
data corruption.

Something I'm inclined to ignore because it seems irrelevant is that,
according to my /var/log/messages, a fair amount of the hangs happend
within an hour of a "modprobe: can't locate module sound-slot-1", although
I suspect that this is more likely to just mean within an hour of me using
the soundcard. (Two hangs also happened less than an hour after loading
the OSS module)

My current .config, the "interesting" parts of /proc, and the output of
dmesg and lspci -vv are at
http://www.contrib.andrew.cmu.edu/~pwc/things.tar
This .config is mostly the same one I've been using all along.

The short version:

UP Celeron 466, 128M (needs mem=127m at boot)
Not sure about the motherboard, but it's got an
onboard intel 810 AGP 4m video card (allegedly disabled in the bios, but
still visible in lspci)
STB Trio64V+ 2m (pci) (This is the one the monitor's plugged into)
Netgear FA 310 TX PCI ethernet card (tulip based)
SoundBlaster 64V PCI
A pair of IDE harddrives and an IDE cdrom, one hd on ide0, the cdrom and
other hd on ide1.
RH 6.0-based, with glibc-2.1.2 recently installed, FWIW.
X-3.3.5, SVGA and S3 servers.
There are no ISA slots, but lspci shows an ISA bridge.

I could probably get a serial cable and use my roommate's box as console
if anyone thinks it would help.

If you want more information or want me to try something, let me know.

Unrelated:

Starting with pre14, (I didn't try 12 or 13), I've got unresolved symbols:
ide-cd : ide_[lots], atapi_{input,output}_bytes, more
nbd, rd, floppy, loop : end_that_request_{first,last}

Only {un,}register_cdrom and cdrom_fops in ide-cd are modversion'd.

These are all the modules in /lib/modules-.../block/, which may be
significant.

-- 
Paul Cassella    ::  Doctor Who: He's back, and it's About Time.
fortytwo@cmu.edu ::  http://www.contrib.andrew.cmu.edu/~pwc

--41954960-1325920753-938732333=:761--

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/