Re: Oops in 2.3.X, 2.2.X, and 2.0.3X with FENRIS and NCR53c875 SCSI Driver

Jeff Merkey (jmerkey@timpanogas.com)
Thu, 10 Jun 1999 17:09:12 -0600


Gerard,

More info. The address on the analyzer trace appears to be a DSP register.
What looks like happened is that someone wrote some state into memorty then
hit the DSP register to initiate a SCSI SCRIPTs operation, then the driver
spun off into oblivion. This box we have a very new, and a lot of stuff is
broken on it besides this. It's a very new PIII system, and We have
problems with some of the hardware (Intel EtherExpress cards don't work
properly -- bus timing problems). There may be a config conflict that's
messing with your driver. I will look at it some more. This may be a case
where Intel shipped buggy chipsets and we had the misfortune of getting one
of these boxes.

Is this enough detail for you?

Jeff

----- Original Message -----
From: Jeff Merkey <jmerkey@timpanogas.com>
To: Jeff Merkey <jmerkey@timpanogas.com>; Gerard Roudier
<groudier@club-internet.fr>; <linux-kernel@vger.rutgers.edu>
Sent: Thursday, June 10, 1999 4:55 PM
Subject: Re: Oops in 2.3.X, 2.2.X, and 2.0.3X with FENRIS and NCR53c875 SCSI
Driver

>
> Gerard,
>
> I was not spending a lot of time with this right now becuase I was trying
to
> get FENRIS tested on as much hardware as possible prior to dumping the
code
> to our FTP server, but I will have some time tommorow to look at this in
> more depth if this would be helpful after I put the tarball up on our ftp
> server. What do you want me to look at next?
>
> Jeff
>
>
> ----- Original Message -----
> From: Jeff Merkey <jmerkey@timpanogas.com>
> To: Gerard Roudier <groudier@club-internet.fr>;
> <linux-kernel@vger.rutgers.edu>
> Sent: Thursday, June 10, 1999 4:51 PM
> Subject: Re: Oops in 2.3.X, 2.2.X, and 2.0.3X with FENRIS and NCR53c875
SCSI
> Driver
>
>
> >
> > Gerard,
> >
> > I am using an American Arium Logic analyzer (leased). The bus trace
shows
> > an errant DMA from a busmastering device that trashed memory in my
system
> (I
> > was running a bus trace to track this down). The bus activity indicates
> > that a device mapped to a very odd address in the system memory map
> > (0xB0000000) was doing something when the DMA occurred. I'm an old
> hardware
> > guy who converted to software, but I'm very proficient with hardware
> > analysis tools, in fact, most of my SMP debugging is done with a logic
> > analyzer with an inverse assembler. I also wrote SCSI scripts years
back
> > for the NCR 53C720 (we used SCSI for clustering) so I'm pretty familiar
> with
> > "buggy" SCSI scripts and I've seen this type of behavior before. There
> was
> > also a reset on the SCSI device which would indicate a reselction was in
> > process (saw a test unit ready and inquiry command with a status of '4'
> > being returned from the device if this is helpful.[also have a SCSI
> > analyzer(leased)]) .
> >
> > I have not looked through your scripts for the driver and don't have a
> > scripts compiler (does one ship with Linux)-- but could. Should I do
this
> > next?
> >
> > Jeff
> >
> > ----- Original Message -----
> > From: Gerard Roudier <groudier@club-internet.fr>
> > To: Jeff Merkey <jmerkey@timpanogas.com>
> > Sent: Thursday, June 10, 1999 3:34 PM
> > Subject: Re: Oops in 2.3.X, 2.2.X, and 2.0.3X with FENRIS and NCR53c875
> SCSI
> > Driver
> >
> >
> > >
> > >
> > > On Thu, 10 Jun 1999, Jeff Merkey wrote:
> > >
> > > > Alan,
> > > >
> > > > None of the other setups produce the "..could net get a free page.."
> > > > message. This looks like a subtle race condition with this
particular
> > scsi
> > > > driver. I will try some other scenarios.
> > >
> > > Dear Jeff,
> > >
> > > The first thing you should want to do prior to posting any diagnostic
of
> > > yours to the kernel list could be to, at least, check that you are not
> > > just annoucing some triviality or claiming some stupidity. Note that I
> > > didn't see any triviality in your postings.
> > >
> > > The second thing it to provide maintainers with relevant traces, logs,
> > > source code, scripts that triggers the problem, accurate description
of
> > > the problem, etc ... Your opinion may be interesting, but only if is
> based
> > > on relevant observations, and, by the way, something that is just
> claimed
> > > as "looking like subtle ..." has nothing to do with relevance, in my
> > > opinion.
> > >
> > > Now, if you really are quite sure that the SCSI driver is the thing
that
> > > breaks your system, I invite you to take biggest possible gun and and
> shot
> > > it immediately.
> > >
> > > I am the maintainer of some driver that you may have used on your
> system.
> > > Most of reports from people that experience problems when using such
> > > drivers generally accused the driver first. The reality _is_ that most
> > > of the time, the problem has been proved to be elsewhere.
> > >
> > > A kernel list is public. When you claim 'I think something is broken'
> > > in public, then, a not negligible number of readers may just beleive
you
> > > or just want to beleive you.
> > >
> > > Report the problem as you observe it, please, and avoid comments that
> > > are just bare claims. Thanks.
> > >
> > > Btw, the URL that hosts driver updates is the following:
> > > ftp://ftp.tux.org/pub/roudier/drivers/
> > >
> > > Gérard.
> > >
> > > >
> > > > Jeff
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Alan Cox <alan@lxorguk.ukuu.org.uk>
> > > > To: Jeff Merkey <jmerkey@timpanogas.com>
> > > > Cc: <jmerkey@timpanogas.com>; <linux-kernel@vger.rutgers.edu>
> > > > Sent: Wednesday, June 09, 1999 5:32 PM
> > > > Subject: Re: Oops in 2.3.X, 2.2.X, and 2.0.3X with FENRIS and
> NCR53c875
> > SCSI
> > > > Driver
> > > >
> > > >
> > > > > > Also shows up single processor (different machine, same chipset
> > 53C875).
> > > > =
> > > > > > Looks like a possible SCSI Scipts error. FENRIS, unlike UNIX
> FS's
> > will
> > > > =
> > > > > > talk to more than one disk or partition concurrently since a
> volume
> > can
> > > > =
> > > > > > have multiple segments strip[ed across several drives. May be
> > exposing
> > > > =
> > > > > > some timing condition with the way I am calling breada abd
bread.
> > > > >
> > > > > The raid code already does that kind of concurrency. Now the
> important
> > > > clue
> > > > > is probably
> > > > >
> > > > > > More on this -- am also getting "..could not get a free
> page...".
> > =
> > > > >
> > > > > This indicates a memory allocation failed. That means its more
> likely
> > > > someone
> > > > > doesnt check a get_free_page/kmalloc return and continues
blissfully
> > into
> > > > > oblivion mode.
> > > > >
> > > > > Could be the driver fs or scsi layer. Are your other test setups
> > producing
> > > > > a "free page..." message too ? (ie the working ones)
> > > > >
> > > > >
> > > >
> > > >
> > > > -
> > > > To unsubscribe from this list: send the line "unsubscribe
> linux-kernel"
> > in
> > > > the body of a message to majordomo@vger.rutgers.edu
> > > > Please read the FAQ at http://www.tux.org/lkml/
> > > >
> > > >
> > >
> > >
> >
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/