sg SCSI bus timeout problem (2.3.99pre...)

From: Michael H. Warfield (mhw@wittsend.com)
Date: Sun May 21 2000 - 14:23:38 EST


Hello all,

        I'm running into a problem with SCSI bus timeout problems in
2.3.99. Specifically it's 2.3.99pre9pre3 that I'm working with right now.

        First the good news... Up until recently, the 2.3 kernels have
been in serious hurt cutting CD's with cdrecord on a pure SCSI system.
The 2 Meg shared memory fifo would drain from 100% to next to nothing
and then you would hold your breath hoping it wouldn't underrun. With the
2.3.99pre9preX the fifo is running right tight to 100% and I've seen
no underruns burning the data.

        Where I am having problems is closing out the session. The final
close out command is giving me an error 30 seconds after initiating
with this error being returned from cdrecord:

] cdrecord: Input/output error. close track/session: scsi sendcmd: retryable error
] CDB: 5B 00 02 00 00 00 00 00 00 00
] status: 0x2 (CHECK CONDITION)
] Sense Bytes: 70 00 06 00 00 00 00 0A 00 00 00 00 29 00 00 00
] Sense Key: 0x6 Unit Attention, Segment 0
] Sense Code: 0x29 Qual 0x00 (power on, reset, or bus device reset occurred) Fru 0x0
] Sense flags: Blk 0 (not valid)
] cmd finished after 30.393s timeout 480s

        Now... That last line is saying it died 30 seconds after issuing
the command and that it had set the timeout to 480 seconds.

        Here's what I'm getting down in syslog, however:

] May 21 14:35:36 alcove kernel: scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 0, lun 0 Read (10) 00 00 04 3c 51 00 00 20 00
] May 21 14:35:36 alcove kernel: scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 0, lun 0 Read (10) 00 00 04 3c 79 00 00 58 00
] May 21 14:35:37 alcove kernel: SCSI host 0 abort (pid 0) timed out - resetting
] May 21 14:35:37 alcove kernel: SCSI bus is being reset for host 0 channel 0.
] May 21 14:35:40 alcove kernel: (scsi0:0:0:0) Synchronous at 5.0 Mbyte/sec, offset 15.
] May 21 14:35:40 alcove kernel: (scsi0:0:6:0) Using asynchronous transfers.
] May 21 14:35:41 alcove kernel: (scsi0:0:4:0) Synchronous at 10.0 Mbyte/sec, offset 15.
] May 21 14:35:45 alcove kernel: (scsi0:0:3:0) Synchronous at 10.0 Mbyte/sec, offset 15.
] May 21 14:35:45 alcove kernel: (scsi0:0:1:0) Synchronous at 5.0 Mbyte/sec, offset 15.

        Oh oh!!! The kernel timed out the command and reset the SCSI bus.
That basically agrees with the error message returned by cdrecord. But
the timeout was suppose to have been 480 seconds, not 30 seconds! That's
off by a factor of 16! WTF???

        I saw in sg.c that the timeout value is suppose to be in jiffies.
cdrecord is calculating a timeout value and using SG_SET_TIMEOUT to
set the command timeout to the calculated value. That code in cdrecord
is this:

] libscg/scsi-linux-sg.c

] LOCAL void
] sg_settimeout(f, tmo)
] int f;
] int tmo;
] {
] tmo *= HZ;
] if (tmo)
] tmo += HZ/2;
]
] if (ioctl(f, SG_SET_TIMEOUT, &tmo) < 0)
] comerr("Cannot set SG_SET_TIMEOUT.\n");
] }

        The code in sg_ioctl just stuffs that tmo value into spf->timeout
which supposedly defaults to 60 * HZ (SG_DEFAULT_TIMEOUT).

        Best I can figure here is that the timeout is either not getting
set to the correct value or is not getting honored correctly.

        Next "problem" of note, and this may be either "that's the way
it's suppose to operate" or "there's nothing we can do about it". During
a record, I notice certain periods where it seems like the entire SCSI
bus is locked out. In particular, during an erase (a full erase can take
as long as 80 minutes at 1x speed) or during session close / fixation,
everything seems to come to a screeeeetching halt while those long duration
commands are in progress. I thought the SCSI bus was suppose to be
multi-threaded. While the burner is closing out a session, I can not
get another block to move on that bus. Is there a reason for this?
Would a dedicated SCSI controller for the CD-RW burner help? (God, I might
as well be using IDE then!) I've only gone to pure SCSI bus systems
fairly recently, so this may have always been true, I just hadn't noticed
it until the last year or so...

        My system, where I'm having the problems right now, is an 350 MHz
AMD with 128Meg Ram and several multi Gig SCSI drives.

/proc/cpuinfo:

processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 350.791041
cache size : 64 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mmx 3dnow
bogomips : 699.60

        The CD-RW burner is a Yamaha 4x4x16 SCSI burner.

        The SCSI Controller is an Adaptec AHA-294X Ultra SCSI host adapter.

        I'm going to try and boost those timeouts way way up to see if
it's just a muliplier screwed up somewhere... I don't recall having this
timeout on the session close occuring under the 2.2.x kernels, however.

        Mike

-- 
 Michael H. Warfield    |  (770) 985-6132   |  mhw@WittsEnd.com
  (The Mad Wizard)      |  (770) 331-2437   |  http://www.wittsend.com/mhw/
  NIC whois:  MHW9      |  An optimist believes we live in the best of all
 PGP Key: 0xDF1DD471    |  possible worlds.  A pessimist is sure of it!

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue May 23 2000 - 21:00:20 EST