2.0.37pre SMP + AIC-7895 (v5.1.11) = reproducable freeze

Wilf (G.Wilford@ee.surrey.ac.uk)
Wed, 17 Mar 1999 10:44:46 +0100


I've been suffering from mysterious system freezes ever since my dual
PII-400 went into productive service last year. I tried various
2.0.36pre kernels, irq-reentry patches, latest aic7xxx drivers and
other advice and the occurance of freezes went from few hours to few
days to few weeks.

The load on this server has been steadily increasing over the months
and the time-to-freeze started to come back down again. By yesterday,
it was freezing several times per day again.

I took it out of service for an evening and gave it a full workout:

SuperMicro P2DBS Dual PII 440BX (onboard dual channel AIC-7895 u/w SCSI)
Dual PII-400, Intel EtherExpress Pro 10/100, DEFPA FDDI, 2x WDE9100AV
linux-2.0.37pre7 (prepatched with aic7xxx v5.1.11)

The workout consisted of a full CPU/memory/IO/network soak, generating
a sustained load average of ~20. Here are the results:

kernel SCSI outcome
2.0.36pre22 SMP aic-7895 v5.1.4 (channel A+B) freeze < 10 mins
2.0.37pre7 SMP aic-7895 v5.1.11 (channel A+B) freeze < 10 mins
2.0.37pre7 SMP aic-7895 v5.1.11 (channel A) freeze < 10 mins
2.0.37pre7 UP aic-7895 v5.1.11 (channel A) no freeze
2.0.37pre7 SMP ncr53c875 v3.1e no freeze

The freezes were complete system lockup, with the scsi and one of the
disk access lights left on full...

In the final test with the borrowed ncr53c875, we sustained a load
average of around 35 for an hour. It did stumble once, but recovered
gracefully:

Internal error: bad swap-device
Trying to free nonexistent swap-page
Internal error: bad swap-device
Trying to free nonexistent swap-page
ncr53c875-0:0: SIR 16, incorrect nexus identification on reselection
scsi : aborting command due to timeout : pid 227560, scsi0, channel 0, id 0, lun 0 Read (6) 08 5d fa 08 00
ncr53c8xx_abort: pid=227560 serial_number=227574 serial_number_at_timeout=227574
SCSI host 0 abort (pid 227560) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
ncr53c8xx_reset: pid=227560 reset_flags=2 serial_number=227574 serial_number_at_timeout=227574
ncr53c875-0: resetting, command processing suspended for 2 seconds
ncr53c875-0: restart (scsi reset).
ncr53c875-0: enabling clock multiplier
ncr53c875-0: Downloading SCSI SCRIPTS.
ncr53c875-0: command processing resumed
ncr53c875-0-<0,*>: WIDE SCSI (16 bit) enabled.
ncr53c875-0-<1,*>: WIDE SCSI (16 bit) enabled.
ncr53c875-0-<0,*>: FAST-20 WIDE SCSI 40.0 MB/s (50 ns, offset 15)
ncr53c875-0-<1,*>: FAST-20 WIDE SCSI 40.0 MB/s (50 ns, offset 15)

I don't know whether the onboard AIC-7895 is flakey, the driver is
flakey or something else is causing the hardware/driver to enter an
irrecoverable state. Anyway, using the ncr53c875 I now have what
appears to be a stable SMP system at last.

Cheers,
Wilf.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/