Patches for SCSI timeout bug

From: linas@austin.ibm.com
Date: Wed Jun 04 2003 - 16:34:16 EST


Hi,

I've got a SCSI timeout bug in kernels 2.4 and 2.5, and several
different patches (appended) that fix it. I'm not sure which way
of fixing it is best.

Hardware:
IDE DVD/CDROM connected to ACHIP ARC765 based SCSI-to-IDE converter,
attached to symbios controller using sym53c8xx driver.

SYMPTOMS:
When booting, system hangs because the initial SCSI bus scan times
out when it gets to this device, causing a command abort, which
times out, and thence (in kernel 2.4) into an infinite loop of
resets and timeouts. In kernel 2.5, its not an infinite loop;
only two resets, but the device is never found.

ROOT CAUSE:
During boot, the sym53c8xx driver performs a SCSI bus reset.
The Achip takes about 15 seconds after a bus reset before it
is williing to reply to scsi commands. However, in the current
code for the initial bus scan, a device is given 6 seconds
before scsi target aborts, resets, etc. come raining down.

I've got some lengthly SCSI bus traces if anyone cares to look.

MORE DETAILS:
During boot, the sym53c8xx driver for the SCSI controller
performs a SCSI bus reset. (Other drivers may or may not
perform this reset; some are configurable). After the
reset, it waits 2 seconds before starting a bus scan.
(Some other drivers wait 5, others 10; others maybe
more or less). During the bus scan, generic (common
among all drivers) SCSI code gives each device 6 seconds
to respond. If the device doesn't respond, the code
launches into a sequence of target aborts, bus resets,
etc. in an attempt to recover.

If the DVD/CDROM is scanned early in the bus scan, then
it will not have had time to finish reseting itself before
its scanned, and it won't respond fast enough, leading to the
bad behaviour. If the machine has lots of disks, then the
CDROM is scanned later, giving it enough time & then everything
is fine.

FIXES:
There's several ways to fix this:
1) By increasing the generic SCSI bus scan timeout to be
   longer than 15 seconds (as well as the timeout for a
   bus reset to be longer than this).

2) By incresing the sym53c8xx post-reset delay to at least
   12 seconds.

Fix 2) may not be bad: I have at least one scsi hard drive which
takes 5 seconds to recover from a bus reset. On the other hand,
fix 2) makes the boot process longer: it introduces a delay of
N x 12 seconds, where N is the number of scsi channels.
(Most cards have two channels; some server-class machines with
many cards may have a significantly longer boot).

Fix 1) does not introduce any delay at all, if the SCSI
devices respond quickly. Fix 1) also will stop the problem
from recurring if/when this CDROM is attached to something
other than a sym53c8xx.

I like fix 1) better, but I'm not a Linux SCSI guy, so I don't
really know & can't make this choice .... Below are some
patches for kernel 2.4; they are almost identical for kernel 2.5.

--linas

PATCHES for 'fix 1': (note these also fix a compile-time warning
in this code):

Index: scsi_scan.c
===================================================================
RCS file: /cvs/linuxppc64/linuxppc64_2_4/drivers/scsi/scsi_scan.c,v
retrieving revision 1.19
diff -u -r1.19 scsi_scan.c
--- scsi_scan.c 8 Jan 2003 18:47:06 -0000 1.19
+++ scsi_scan.c 29 May 2003 23:02:29 -0000
@@ -576,9 +576,10 @@
        SRpnt->sr_cmd_len = 0;
        SRpnt->sr_data_direction = SCSI_DATA_READ;
  
+ /* Some AChip ARC765 devices take 15 seconds recover from bus reset */
        scsi_wait_req (SRpnt, (void *) scsi_cmd,
                  (void *) scsi_result,
- 256, SCSI_TIMEOUT+4*HZ, 3);
+ 256, SCSI_TIMEOUT+15*HZ, 3);
  
        SCSI_LOG_SCAN_BUS(3, printk("scsi: INQUIRY %s with code 0x%x\n",
                SRpnt->sr_result ? "failed" : "successful", SRpnt->sr_result));

Index: scsi_obsolete.c
===================================================================
RCS file: /cvs/linuxppc64/linuxppc64_2_4/drivers/scsi/scsi_obsolete.c,v
retrieving revision 1.4
diff -u -r1.4 scsi_obsolete.c
--- scsi_obsolete.c 22 Apr 2002 15:33:14 -0000 1.4
+++ scsi_obsolete.c 29 May 2003 23:02:29 -0000
@@ -106,21 +106,15 @@
 static void scsi_dump_status(void);
 #endif
  
-
-#ifdef DEBUG
-#define SCSI_TIMEOUT (5*HZ)
-#else
-#define SCSI_TIMEOUT (2*HZ)
-#endif
-
+/* same timeouts as scsi_error.c */
 #ifdef DEBUG
 #define SENSE_TIMEOUT SCSI_TIMEOUT
 #define ABORT_TIMEOUT SCSI_TIMEOUT
 #define RESET_TIMEOUT SCSI_TIMEOUT
 #else
-#define SENSE_TIMEOUT (5*HZ/10)
-#define RESET_TIMEOUT (5*HZ/10)
-#define ABORT_TIMEOUT (5*HZ/10)
+#define SENSE_TIMEOUT (10*HZ)
+#define RESET_TIMEOUT (2*HZ)
+#define ABORT_TIMEOUT (15*HZ)
 #endif
  
  

PATCH for 'fix 2'

Index: sym53c8xx_defs.h
===================================================================
RCS file: /cvs/linuxppc64/linuxppc64_2_4/drivers/scsi/sym53c8xx_defs.h,v
retrieving revision 1.8
diff -u -r1.8 sym53c8xx_defs.h
--- sym53c8xx_defs.h 22 Apr 2002 15:33:14 -0000 1.8
+++ sym53c8xx_defs.h 4 Jun 2003 21:24:49 -0000
@@ -269,7 +269,7 @@
 /*
  * Settle time after reset at boot-up
  */
-#define SCSI_NCR_SETUP_SETTLE_TIME (2)
+#define SCSI_NCR_SETUP_SETTLE_TIME (15)
  
 /*
 ** Bridge quirks work-around option defaulted to 1.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Jun 07 2003 - 22:00:25 EST