Re: Oops while booting

Gerard Roudier (groudier@club-internet.fr)
Sat, 9 Jan 1999 09:56:07 +0100 (MET)


On Thu, 7 Jan 1999, Doug Ledford wrote:

> Gerard Roudier wrote:
>
> > > Explanation: scsi_old_done() sees that some devices do not respond to
> > > request sense, and calls scsi_reset(SCpnt, SCSI_RESET_SYNCHRONOUS). This
> > > function assumes that the Scsi_cmnd->scsi_done() is NOT called, but the
> > > ncr53c8xx calls this function, and some time later the system crashes
> > > because the same command is processed twice. (see the warning in
> > > scsi_obsolete.c, function scsi_old_done())
> >
> > Your explain what happens, but I think it is scsi_obsolete.c that broke
> > the previous behaviour of scsi.c. The ncr53c8xx driver is fine with
> > respect to what the SCSI_RESET_SYNCHRONOUS flag is supposed to expect from
> > low level drivers, in my opinion.
> >
> > The SCSI_RESET_SYNCHRONOUS flag tells low level drivers to call the done()
> > routine even if the command is not actually queued to it.
> > So, scsi_obsolete.c is wrong when trying to REDO this command but should
> > just return, IMO, as this was coded in 2.0 kernel/scsi.c
>
> Check the 2.0 scsi.c code. It doesn't just return in all SCSI_RESET_SYNC

Will do.

> cases. That's part of the problem. There are three different places that
> called SCSI_RESET_SYNC and they did two totally different things and gave the
> low level driver no way to know the difference. Therefore, in the 2.0 scsi.c
> code, there were checks to avoid re-use of the same command and double
> queueing of commands. That's evidently what has been broken, the code itself
> for SCSI_RESET_SYNC has always been broken and there was no right way to do
> things, any way you did it was guaranteed to be wrong sometime. That's one of
> the reasons why the mid-level code has needed re-written for so long.

The SCSI_RESET_SYNCHRONOUS (at least in 2.0) is used when the reset()
operation of the low level driver is entered from the scsi_done()
callback. In my opinion, it has been brain-damage in the first place to
expect every driver to behave correctly in such a situation even with some
flag, that tells about the situation, was passed to it. In the early time
of the ncr53c8xx linux driver, when the reset flags didn't exist, I ended
up to assume that the scsi code was capable of reentering the driver
operations in any situation and tried to deal with that. The weird
recursions from scsi_done()/reset()/scsi_done()/queuecommand()/etc... is
handled in this driver using a singly linked waiting list that is just
picked up before being flushed or requeued, leaving the recursion to work
in a new instance of the list if recursions occur. BTW, I donnot have had
time enough for really studying the 2.1/2.2 scsi code and thus, I donnot
plan to adapt the ncr driver to it for linux 2.2 release.

Manfred did an excellent work tracking this problem. Since I cannot
reproduce it, I only can think to a fix on paper, and for now, I donnot
have found anything satisfying. It seems that numerous scsi low-level
drivers still use the old scsi code and so may be affected by the problem.
If the scsi_obsolete stuff is the culprit, then we probably should try
fixing it instead of still bloating low-level drivers.

Regards,
Gerard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/