SCSI problems 2.0.32 and recent 2.1.xx kernels

Linux Kernel - Mike Tubby (mike@thorcom.com)
Mon, 15 Dec 1997 15:02:21 +0000


At 14:04 15/12/97 +0000, alan@lxorguk.ukuu.org.uk (Alan Cox) wrote:
>
>> I have a half dozen boxes with 2940's in them. none of them have
>> experienced scsi related problems under 2.0.32...
>>
>
>The fact you havent doesnt mean nobody has alas..
>

Hmmm... 2.0.32 (also > 2.1.66) and SCSI - Arrrggghhhh !!!!!!

I have the following set up:

Hardware:

Intel Providence (PR440FX) mother board, which includes:
- onboard Adaptec AIC7xxx in standard/wide mode
- intel EtherExpress 10/100
- sound
2 x GeniuneIntel P200-Pro 200MHz, 256K cache, step 9 CPU
2 x 3COM 3C900 Bomerang PCI ethernets
2 x 64Mb ECC DIMM = 128Mb main memory
A disk array cabinet (*not* RAID) with the following drives
in it:
1 x Fujitsu 1Gb SCSI
1 x HP C3726SA 2Gb SCSI
2 x Micropolis Tommahawk 9GB SCSI
1 x SONY DAT 8Gb
1 x Philips CD-2000 writer
1 x Hitachi 2xSpeed CD reader
all are single-ended narrow (8-bit) bus.

Software:

RedHat 5.0, linux 2.0.32 and/or 2.1.66 or 2.1.72
GCC 2.7.2.3 etc. al from RH 5.0

Environment:

Mixed Linux, Novell, Windows-NT, Windows-95 all accessing the
machine at the same time...

I have SCSI problems with 2.0.32 and recent 2.1.xx kernels! There
are two distinct problems:

A. Tagged command queueing/SCBs/disconnection ???
-------------------------------------------------

First problem fits the symptoms that Dave (below) describes, under
heavy load I get the same problems:

>Date: Wed, 3 Dec 1997 23:41:13 -0500 (EST)
>From: Dave Andruczyk <dave@www.buffalostate.edu>
>To: linux-smp@vger.rutgers.edu
>Subject: SCSI problems.
>
>Ever since upgrading to 2.0.32, I noticed the AHA788X scsi driver is new.
>Since then I have lost the Same drive TWICE due to heavy I/O load. ( doing
>three copies of multiple megabyte files to that drive. ( I have verified
>this is NOT a drive failure..)
>
>The drive was complete blown apart, and even fsck couldn't fix it. (
>TWICE!!!!) :(
>I can make it happen over and over.. just by starting severl cp's on the
>system.
>
>Tagged Command Queueing was ENABLED
>Override driver defaults per LUN was disabled
>Enable SCB paging was ENABLED
>Collect Stats in /Proc was ENABLED
>
>The kernel complains about:
>Dec 3 22:19:38 scarlet kernel: scsi : aborting command due to timeout :
>pid 2540326, scsi0, channel 0, id 6, lun 0 Read (6) 00 00 8b 0d 00
>Dec 3 22:19:38 scarlet kernel: (scsi0:6:0) Abort_reset, scb flags 0x1,
>while idle, LASTPHASE = 0x1, SCSISIGI 0xe6, SEQADDR 0x7, SSTAT0 0x27,
>SSTAT1 0xb
>Dec 3 22:19:38 scarlet kernel: (scsi0:6:0) Queueing an Abort SCB.
>Dec 3 22:19:38 scarlet kernel: (scsi0:6:0): Abort message sent.
>Dec 3 22:19:38 scarlet kernel: (scsi0:6:0) SCB 14 abort completed.
>Dec 3 22:19:38 scarlet kernel: (scsi0:6:0) Reset device, active_scb 6
>Dec 3 22:19:38 scarlet kernel: scsi0: (targ 6/chan A) matching scb to
>(targ 0/chan A)
>Dec 3 22:19:38 scarlet kernel: scsi0: (targ 6/chan A) matching scb to
>(targ 5/chan A)
>Dec 3 22:19:38 scarlet last message repeated 6 times
>

Only difference is that I did not have "gather stats for /proc" enabled.

I have suggestions that its:

1. a SCSI bus termination problem
2. a drive disconnection problem with the new kernel driver
3. buggy firmware in drives

Now, I've been into the BIOS and turned off SCSI disconnection. I've
re-built the kernel with command queuing depth of 2 and 1 (effectively
off ???) and fiddled.

I have it *much* more reliable but *not* fixed.

Anyone help? Suggestions?

B. File system and FSCK blows up under load
-------------------------------------------

On my system (above) on which I run samba, if I grab a really big
block of files from a novell server and drop them on to the Linux
box (hmmm... heavy duty drag and drop copy via a Win95 work-station)
... typically after around 100-200Mb of files I get the error(s)
in (A) above and eventually it all dies.... [in the limit I end
up with the kernel printing the message "idle task cannot sleep"
over and over"].

Next time I boot the system and rc.sysinit runs fsck it dives off in
to checking /dev/sdc1 and /dev/sdd1 (the two big Micropolis drives)
as they were not 'clean' and dumps out with:

Cannot dereference kernel null pointer.... stack dump etc.
<- register dump ->

OR!!!

Cannot resolve lock from context...

I was allowing fsck to "use the parallelism in the hardware" by
checking both drives at the same time.

So, disable this (change fstab) and do them sequentially... first
drive checks out and is mountable... second drive is stuffed and still
causes fsck to dump out.

End up with a trashed file-system and no data! Argh! So, check
drive out okay with independant utils and badblocks - all seems fine.

Do mke2fs on /dev/sdd1 and put it all back on and same happens again
a week later - now REAL ARRGGGHHHH!!!! costing a fortune in admin time
and lost data.

Any suggestions ideas?

Mike

--
Michael J Tubby  B.Sc.  G8TIC
Technical Director, Thorcom Systems Limited
Tel: 01 905 756700 (intl: +44 1 905 756 700)
Fax: 01 905 755777 (intl: +44 1 905 755 777)
Web: http://www.thorcom.com
Email: mike@thorcom.com