I am posting to these two groups as the solution is somewhere in between :-)
This problem has been going on for a long time. I sort of learned to live
with it as no one seemed to be able/interested in helping me solve it.
1. Stage: 1.3.71, 64MB RAM, eata_dma and/or Buslogic, etc.
2. Action: cd /some_big_filesystem;
find . -print | cpio -dmpv /another_big_partition
You can replace this with tar | tar, etc.
3. Result: SCSI bus reset in a loop due to timeout on one
disk or another (typically the same disk).
How reproducible is this supposed to be? I built a vanilla 1.3.71 kernel,
setup one 2.1GB disk as /x and another as /y, populated /x with 377MB of X11
source files, and then executed the above find | cpio (without the -v) 10 times
in a row copying from /x to /y and then deleting the files on /y. No problems
whatsoever.
These are NOT bugs in the eata_dma driver, nor the BusLogic driver (unless
they are both bad exactly the same way - hard to swallow).
The bus reset comes from a layer above the HBA. Different HBA's react
differently but the result is the same:
FAST SCSI I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES.
LARGE BLOCK I/O ON LINUX IS IMPOSSIBLE WITHOUT CRASHES
Well, actually I believe there are presently problems with abort and reset
handling that are inherent in the current interfaces, and so there is no way
the drivers can work correctly. I described one of these problems last week
but I haven't seen any discussion so perhaps that message was swallowed
somewhere. More on these issues in a subsequent message.
We had this problem for a long time. I posted it several times. I can
repeat and reproduce it any time.
Do you have any idea if 1.2.13 suffers from this problem as well, or is it only
the 1.3.x kernels that do? Can you give me better guidance on setting up a
test environment to reproduce this?
The problem cannot be reproduced by random seeking, dd or any other trivial
(but useless) method. It van only be reproduced by doing the type of fast
copy I am describing above. A clue can be probably found in the fact that
backup to tape never crashes, rs (wich does random read) never crashes,
dd to /dev/null never crashes. It always crashes on the WRITE side,
seemingly to the same drive.
The error is always an infinite loop of
``SCSI: resetting host scsi[01] due to target n''
Does it look something like this:
<6>SCSI host 0 abort (pid 395210) timed out - resetting
<6>scsi0: Resetting BusLogic BT-958 due to Target 0
<6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
<6>SCSI host 0 abort (pid 395560) timed out - resetting
<6>scsi0: Resetting BusLogic BT-958 due to Target 0
<6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
<6>SCSI host 0 abort (pid 395210) timed out - resetting
<6>scsi0: Resetting BusLogic BT-958 due to Target 0
<6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
<6>SCSI host 0 abort (pid 395560) timed out - resetting
<6>scsi0: Resetting BusLogic BT-958 due to Target 0
<6>scsi0: *** BusLogic BT-958 Initialized Successfully ***
but with no recovery actually occurring? Unfortunately, I cannot generate
these problems on demand as you appear to be able to.
It is always as a result of a ``timeout''. There is no way to kill it,
sync never returns, umount never returns, df never returns. Therefore
shutdown never completes.
The errors I've been looking into all start with timeouts as well, but that's
most likely because the abort/reset error recovery mechanisms are where the
problems are, and those are only exercised when timeouts occur (well, resets
also happen when commands fail).
Leonard