Re: [SMP patch] io-apic-patch-2.1.97-A

Robert HYATT (hyatt@cis.uab.edu)
Sun, 19 Apr 1998 12:28:32 -0500 (CDT)


On Sun, 19 Apr 1998, Bill Broadhurst wrote:

>
> 1. Processes mysteriously die during a long (more than 2 hour)
> comple. The dead process can't be killed and shows 'D' in
> ps. If that process happens to be linked to a device, that
> device is continually 'busy' and only a reboot can free it.
> Any other process that touches the 'busy' device also dies.
> If this device is the root file system, the system processes
> will also (eventually) die, leaving the system in a "hung"
> state. No net access is possible and the Magic SysRq keys
> work (sorta) most of the time. <Magic-boot usually does.>
> This happens 100% of the time during a long modeling program or
> when re-compiling the entire X package.
>
> This symptom started at 2.1.85 and continues. It happens sooner on
> kernels and later on others. It happens on all my systems. I'll list
> them later.

you should check the kerneld logfile, to see if you see any sort of
message like "eth0: TX timed out..." I get this and my machine hangs,
but not "hard". No net traffic, can't start new processes nor exit
old ones, but somethings "sorta" work for a bit...

I've tracked this to high ethernet traffic blowing out the ethpro100
driver, getting it into a state from which it can't recover. It's been
there a *long* time, but got *really* bad in 96-pre1 and 96, although I
can't try 97 until tomorrow sometime... unless I build and boot from
home tonite (I hate doing this because if the boot hangs, it's a 20+
mile drive in to my office to unhang it..)

>
> 2. On 2.1.96/7 the system will hang *hard* during a tape backup to
> the SCSI tapes. This also happens 100% of the time on the units with
> tape drives. (Also doesn't matter whether the drive being backed up
> is local or on another machine on the net. All tapes are on
> BusLogic BT930 controllers of various vintages. I did move one to
> my remaining Adaptec controller but it does the same thing but much
> further into the tape.
>

SCSI has a definite problem in 2.1.96. I found I could not copy a large
file from one SCSI drive to another (large=200-500mb) without the machine
hanging *hard*. I backed up to 2.1.85 (the only older kernel I happened
to have saved in a handy place) and the SCSI copies went perfectly with
no problems at all. So something is "up" in 96 for certain, at least with
the combination of the bt958 SCSI and etherexpress Pro 100 ethernet
card. Sendmail would forward me a few email messages, it would hang. I'd
try to ftp a 20mb file from one machine to my quad processor (on a
switched hub which provides good thruput) and it would hang... And then
I found I couldn't even copy files from one drive to another reliably...

> As noted, this symptom began, (I think) at 2.1.96 but I'll have to
> verify this as most backups have been made on a non-intel machine over
> the net since 2.1.88. I did a restore of about 200M of files on this
> machine under 2.1.95 without incident.
>

probably correct. 2.1.95 only had an occasional ethernet hang for me
with high traffic. 2.1.96 seemed to have problems with ethernet and
SCSI I/O (note these are wide SCSI devices [ultra] that are being tagged
as 40mbyte/sec devices).

> This machine:
> Tyan Tomcat IV with dual p200's, 128M dram 512K cache.
> BT930 SCSI with 4 disks, tape, and CDROM
> ATI Mach64 PCI video card.
> 3COM 3c595 PCI NIC.
> 3COM 3c509B ISA NIC.
> SB16 sound card.
>
> Other Linux machine:
> Tyan Tomcat III with dual P166's, 64M DRAM, 512K cache.
> AHA-2940 SCSI with 2 disks, and CDROM.
> ATI Mach 64 PCI video.
> 3COM 3c509B NIC
> SB16 sound card.
>
> Last Linux machine:
> ASUS MB with 64M DRAM dual P166's
> BT930 SCSI with 1 disk, tape, and 1 CDROM
> TSENG labs PCI video card.
> Intel PCI nic.
>
> There are others scattered over the area but these are right here.
> All others have various hardware but most are running 2.1.83 as they
> all show this (#1 above) problem.
>
>
> >
> > it might be that i have accidentally overlooked some bug report, please
> > resend in such cases, there we quite many SMP fixes in recent kernels and
> > 2.1.97+this_patch is supposed to work in 100% of the cases.
>
> Alas, it does not.
>
> I'll be happy to provide any other info you require. I've been keeping
> Linus up to date with the initial problem. Haven't told him about #2 yet
> as it just came to light last night. I'll send him a copy of this.
>
> -bb
>
> --
> ----------------------------------------------------------------------
> Dr. Bill Broadhurst | Independent contract Engineer.
> (619)296-3710 | BIOS, Firmware, & Diagnostics.
> bbroad@CX492564-a.dt1.sdca.home.com | Finger for PGP 5.0 public key.
> ----------------------------------------------------------------------
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu