RE: A3940UW bus resets with zip drive

Doug Ledford (dledford@dialnet.net)
Sat, 17 Jan 1998 12:02:02 -0600 (CST)


On 17-Jan-98 Andreas Fredriksson (kernel account) wrote:
>combined Adaptec 3940UW / Creative Vibra16 on an Asus media bus.
^^^^^^
>aic7xxx: <Adaptec AIC-7880 Ultra SCSI host adapter> at PCI 13
^^^^

Actually, this is a 2940UW, the 3940UW controllers have two different SCSI
chips, a PCI bridge chip, and two distinct SCSI busses (not just two
connectors, one wide and one narrow, but two totally differnt busses with
only one connector each, either wide or narrow depending on if the 3940 card
is a wide or narrow card).

>aic7xxx: BIOS enabled, IO Port 0xd800, IO Mem 0xf9800000, IRQ 10, Revision
>B
>aic7xxx: Wide Channel, SCSI ID 7, 16/16 SCBs, QFull 16, QMask 0x1f
>scsi0 : Adaptec AHA274x/284x/294x (EISA/VLB/PCI-Fast SCSI) 4.1.1/3.2.1
>scsi : 1 host.
>scsi0: Scanning channel A for devices.
> Vendor: QUANTUM Model: FIREBALL ST3.2S Rev: 0F0C
^^^^^^^ ^^^^^^^^
Both the Fireball Tempest and the Fireball Stratus (yours) are known to be
somewhat problematic on the Adaptec controllers. Initial test done by
Justin Gibbs under FreeBSD seem to indicate that part of the problem could
be related to SCSI ID ordering. More on that later in this email.

> Type: Direct-Access ANSI SCSI revision: 02
>Detected scsi disk sda at scsi0, channel 0, id 2, lun 0
> Vendor: IOMEGA Model: ZIP 100 Rev: J.03
> Type: Direct-Access ANSI SCSI revision: 02

>As you can see there are timeouts in both reading and writing, it seems
>the whole scsi bus is locked!

Yep, this is a symptom of what the Fireball drives sometimes do. Of course,
the driver should recover and continue on (although the bus resets are
annoying, they shouldn't be fatal, although the message for kerneld I know
nothing about).

A second possibility is a termination issue. You can get more info out of
the driver by using the following option when you insmod the aic7xxx module:

insmod aic7xxx aic7xxx_verbose=2

My latest driver patch (not quite released yet) includes an update to the
README.aic7xxx file that explains all of the available options and how to
set them. For your kernel, the above should result in *tons* more
information, which you could then send to me so I can take a look at it :)
It would also be helpful to get a description of the internal SCSI bus you
have hooked up, detail how the cable flows from the controller to the first
connected device, from there to the next, etc. through to the last device
and which of those devices have their termination enabled (the last one
should be the only one, but you have two devices commonly found to have
inferior termination, so if the hard disk isn't the last device and the
terminating device, that could be part of the problem).

Now, if it isn't a termination issue, then it probably falls back on the
flaky Fireball drive. So, back to that issue. So far, we've seen that
these drives are somewhat flaky under heavy load. It doesn't have to be
heavy load to that device, it can be heavy load on other devices and only
minor load to the Fireball drive. The conditions are something like this.
The hardware on the Adaptec cards assigns SCSI ID 7 as having the highest
priority during bus arbitration. The priority then goes down the list in
descending order from 7 to 0, then jumps to 15 and goes down to 8. This
means that any time you have two SCSI devices trying to grab the SCSI bus at
the same time, the device with the highest priority always wins. Since your
ZIP drive is SCSI ID 3, controller is 7, and the Fireball drive is 2, both
the SCSI controller and the ZIP drive have higher priority than the Quantum.
One of the possible side effects of this is that during periods of heavy
activity on the ZIP drive, we very could end up sending a command to the ZIP
drive, maybe even queueing up several if it supports tagged queueing (I
don't know, I don't have one) and generally keeping the bus busy. Then,
let's say we send a command to the Fireball. Now, over the next little bit
of time, if the Zip drives keeps connecting to the SCSI card to complete
commands, and the SCSI card keeps sending fresh commands to the ZIP drive,
and they both keep beating out the Fireball drive on bus arbitration, then
the Fireball drive may eventually go comatose and wedge the SCSI bus. Our
theory, based on SCSI bus analyzer traces, is that the drive fails the bus
arbitration phase one too many times, exceeds some sort of counter in the
firmware, and wedges, requiring a bus reset to get things going again.
However, unless I can get hold of a firmware author at Quantum and get some
direct answers out of him, this theory can't be confirmed or denied. Where
we see this most commonly is with CD-ROM drives, since they internally have
a read ahead mechanism farily well suited to their typical usage. For
example, let's say you start a copy frmo CD-ROM to a hard disk, the CD-ROM
will stay ahead of your read fairly well, so it isn't that unexpected that
almost immediately after sending a read request, the CD-ROM often will
reconnect to start the data transfer, and then complete the command, and
then we can often times get the very next command to the CD-ROM on the very
next bus arbitration cycle (plus there will be a series of disconnects
followed by immediate reconnects frmo the CD-ROM during the transfer). Each
of these times, a hard drive may be losing out on bus arbitration, and then
eventually, you get enough losses to make the hard drive do bad things.

----------------------------------
E-Mail: Doug Ledford <dledford@dialnet.net>
Date: 17-Jan-98
Time: 12:02:06
----------------------------------