Odd, odd problem with identical boards not being identical.

From: Chris Meadors (clubneon@hereintown.net)
Date: Wed Aug 09 2000 - 13:17:43 EST


Over the last 2 weeks I have been working my way through problems I've
been having with this new server I'm assembling. Much thanks to both l-k
and l-s for even getting me to thise point.

Short, short rundown of the events up to this point:

I have 2 motherboards (both the same). They are connected to 2 channels
of a 3 channel externel RAID controller. The 3rd channel is connected to
the disk array.

I was having trouble with 1 of the boards oopsing all the time (the other
board did seem to be more stable, but I didn't do much testing on it
because I don't have a clustering file system installed yet so it was just
mounting everything readonly (I do have unique root and /var filesystems
for each motherboard)).

I replaced the current driver for my the SCSI controller (SYM53C896) on
these boards with the most recent version from Gerard's ftp site. This
driver works much better. It no longer oops the machine no matter what I
try to shovel over to the drives. (Gerard count this as a "works for
me" for this driver).

But...

At the point where the older driver would oops this one starts spewing
errors. This happens with the cache on the RAID controller fills up (that
is why when I added more RAM to the controller it was harder to recreate
the problem).

Basicly the errors look like:

sym53c896-0:4: message d sent on bad reselection.
sym53c896-0-<4,0>: extraneous data discarded.
sym53c896-0-<4,0>: COMMAND FAILED (89 0) @f75c1800.
sym53c896-0-<4,0>: ordered tag forced.
scsi : aborting command due to timeout : pid 0, scsi0, channel 0, id 4,
lun 0 Write (10) 00 00 08 17 78 00 00 10 00
sym53c8xx_abort: pid=0 serial_number=87234 serial_number_at_timeout=87234

They just keep coming until the process accessing the disk hangs in a 'D'
state.

Okay, so I'd say there is something wrong with my RAID controller. Well I
totally corrupted the root filesystem of the motherboard where I was doing
all my testing, so I moved over to the second one. Set it so it would
have r/w access to the disks. And continued testing. On this board I
could not recreate one problem at all. It is rock solid.

I have spent today compairing the settings for both motherboards, SCSI
controllers, and the channels on the RAID controller. Everything matches
exactly. So I swapped cables to the malfunctioning board was using the
cable and channel on the RAID controller that the proper functioning board
had. The problem stuck with the board. These motherboards have dual SCSI
channels. I tried the other channel, the problem was still there.

Does anyone have a clue as to what could cause this? It is like one
mother board's controller can sence that the queue is full and backs off
with its writing, while the other one can't, and just keeps going full
speed into the wall.

Again, I can't say how thankful I am to both lists for helping me thus
far,
Chris Meadors

-- 
Two penguins were walking on an iceburg.  The first one said to the
second, "you look like you are wearing a tuxedo."  The second one said,
"I might be..."
                                              --David Lynch, Twin Peaks

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Aug 15 2000 - 21:00:19 EST