Re: scsi-problem (phase change ?)

Gerard Roudier (groudier@club-internet.fr)
Sun, 22 Jun 1997 11:47:11 +0200 (MET DST)


On Sun, 22 Jun 1997, Hauke Johannknecht wrote:

> i have a "evil" problem here ...
> it trashed some partitions 3 times now.
>
> i am using kernel pre-2.0.31-1 updated with ncr-1.18f ...
> scsi-host is a no-probs-till-now-NCR-810 with
> 5 devices attached.
>
> ID 0 -- IBM-DCRS 4.5 GB (new in the system, maybe the troublestarter)
> ID 1 -- Quantum LPS 105 (just dont ask ...)
> ID 3 -- Seagate ST1600N (OLD, but works ...)
> ID 5 -- Sanyo Quad-CDROM
> ID 6 -- HP 6020i (now take a guess why i keep the seagate ...)
>
> the ST got some heat-problems. but i
> keep it for "buffer"-usage, most times
> its powered down. so no prob.
>
> the system trashed data on the DCRS in the last two days.
> up to complete partition-corruption.
>
> only relevant comment in syslog was something like
>
> ncr53c810-0-<0,0>: phase change 2-7 6@00249c20 resid=2.
> ncr53c810-0-<0,0>: phase change 2-7 10@0024962c resid=4.

The scsi controller saw a phase change from COMMAND phase to
MESSAGE_IN phase, with some residual data of the SCSI COMMAND not
accepted by the drive. If we exclude some problem in the DCRS drive,
the most probable reason is some bad signal level on the SCSI bus that
corrupted data or broke the scsi protocol.
We probably should expect such problems to be recovered, hewever,
error recovery is very hard to implement and to test and, in any case,
it is not possible in my opinion to recover from all kinds of errors.

I think that mixing old and recent devices and devices with too different
purpose and speed on the same SCSI bus, or connecting too many
devices on the same SCSI bus increases the probability of SCSI problems.

> seems to happen only if the system is running under
> heavy load AND the ST is powered up some time ...

Do you mean that you powered up the ST while the system is running?

> (can an overheated hdd data-kill another one via the scsi-bus ?)

Since the SCSI bus is a shared resource, any device on the bus can
make the resource unusable.

> questions now:
> - WHAT are these errors ?
My response is above.

> - WHY is it happening ?
You should send this question to Mr Murphy. :)

> - WHO is responsible ?
Us.
You, because your SCSI bus configuration looks like something that risks
a lot to get problems, and if you used to switch you ST under heavy load.
And me, if it is possible to recover from such errors.

> - HOW can i stop it ?
Trying to recover for such errors in the driver, if it is possible, would
perhaps cure the consequence but not repair the system, if as I think
your SCSI system (all components sharing the ressource) uses a mix that
increases too much SCSI problem probability.
It is better to try to fix the cause, in my opinion.

My recommendation is to use more than 1 scsi BUS and to distribute devices
among buses in a way that will minimize the risk to get SCSI problems.
2 buses is generally enough for most systems.
Base choice on speed, purpose, age, quality, etc.. of scsi devices.
That cannot be bad, at least for performance when you are using 2 devices
with very different speed at the same time.

As an example, here is my SCSI system description:

- NCR53C810 that drives a IBM S12 narrow fast SCSI-2 HD and a Toshiba
3401D SCSI-2 CD/ROM.

- NCR53C875 that drives an Atlas I Wide HD and an Atlas II Ultra Wide HD.

All that stuff with a BUS as short as possible and only active
terminations.

Gerard.