Re: Hard lockup with 2.0.34pre15

Doug Ledford (dledford@dialnet.net)
Tue, 19 May 1998 12:27:30 -0500


Ondrej Feela Filip wrote:
>
> On 19 May 1998, Jens Lautenbacher wrote:
>
> >
> > Just a little follow-up to that: As I was just told, there was another
> > problem yesterday that may be related to this:
> >
> > Suddenly bash stopped working (it gave segfaults when trie to be
> > started) which was fixed by copying the bash binary from another
> > machine...
> >
> > We had this problem with rshd and tcsh, too in the past. Those
> > binaries suddenly bcame corrupted (E.g. the tcsh binary from that
> > machine differed in 16 consecutive bytes from the tcsh of another
> > machine -- and yes, both machines had the same distribution redhat 5.0
> > if that matters)
>
> I have the same problem! My machine 2.0.34pre15 Adaptec2940U, 3c905, (And
> SDL card with proprietary SDL Frame realy driver :-(( ) sometimes hangs
> and sometimes corrupts some important file. Last it was bash and libc. :-(

Both of these descriptions are reminiscent of two separate items. First, in
the RedHat-5.0 distribution there is a 16 byte memory scribble in the
aic7xxx driver that ships by default (as well as in the aic7xxx driver in
2.0.33). Theoretically, this shouldn't cause disc corruption on static
files though because even if we memory scribbled, we wouldn't be writing it
back out. The memory scribble in question was specifically only applicable
to in-memory copies of programs or code.

Secondly, if this is happening with the later drivers, such as it has
actively happened with the aic7xxx driver in 2.0.34pre15, then I'm very
suspicious of hardware. To put it bluntly, the newest aic7xxx driver
increases the DMA load on your system for the same given number of
commands. This increased DMA load can be enough to cause hardware glitches
to show up in marginal systems. I'm not guessing that this is the case, I
have a machine here that I can prove it with. So, a simple test to see if
this happens:

get the source for linux-2.0.33.tar.gz from ftp.kernel.org and put it in the
/usr/src directory. Save your current linux source tree to another
directory name (such as linux.real). Then run this script:

#!/bin/sh

cd /usr/src
tar xzf linux-2.0.33.tar.gz
mv linux linux.orig
while true
do
tar xzf linux-2.0.33.tar.gz
diff -U 3 -rN linux.orig linux
rm -fr linux
done

----------End script------------

That script will run forever until killed. In general, you should never see
any output from that script. If you do, you're getting hit by a hardware
glitch and need to track down the source (either bad RAM, bad cache, bad
CPU, whatever). If you see output from this script, then try changing
various BIOS timing items for RAM and cache or disabling cache until it goes
away. When it goes away, your machine should then be stable. Don't forget
to check things like CPU fans and PCI options as well if you have any errors
from this script.

Let me know what you find out, because if there is a bug in regards to some
sort of scribble, I definitely want to get it found. However, do me a favor
and test against the aic7xxx-5.0.15 driver. There is a patch for the
2.0.34pre15 kernel to update it to the 5.0.15 driver at
ftp://ftp.dialnet.net/pub/linux/aic7xxx/2.0.34pre15/aic7xxx-5.0.15-34pre15.patch.gz

-- 

Doug Ledford <dledford@dialnet.net> Opinions expressed are my own, but they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu