RE: PATCH: Further aacraid work

From: Salyzyn, Mark
Date: Tue Jun 29 2004 - 14:29:16 EST


Although I am gratified that 1.1.5-2345 works, and works well (*many*
performance improvements have occurred in this driver, but some of those
have reached Alan's patch), I am at a loss because I was sure that the
lockup was associated with commands taking longer than 60 seconds to
complete (a 'can happen' with a complicated RAID card with multiple
targets, that unfortunately gives the SCSI layer and especially ext3
some headaches). The workaround in the reset handler is to let the
commands quiesce to give the Firmware some breathing space to respond to
the test unit ready command that is issued to the target immediately
following return from the reset handler.

This not solving the problem has the implication, as Alan had noted,
that this could be an issue with the adapter itself locking up or going
unresponsive under load. Both reports, this one and "PERC 3/Di broken
2.6.6-mm5 -> 2.6.7-rc1-mm" now smell of some subtle problems creeping
in.

Can we get some additional tests on this?

1) Does this problem occur on a single CPU (Hyperthreading disabled
too)? I am assuming some locking issues appeared, mainly because of some
restructuring in that code.
2) If you turn off Cache (Read & Write), does the problem persist? This
is the usual condition with older Firmware where the duration could be
extended as a result of the priority at which a Cache Flush occurs
blocking commands.
3) The `Adaptec' driver has a beefed up health checker
(aac_rx_check_health) in rx.c that may be useful for catching what is
called a `blinkLED' report from the adapter. If this piece of code is
moved into the 2.6.7 driver with a single printk reporting the return
value from the health check in linit.c we may get a clue to the
conditions of failure if it is in the adapter.

Sincerely -- Mark Salyzyn

-----Original Message-----
From: Byron Stanoszek [mailto:gandalf@xxxxxxxxx]
Sent: Tuesday, June 29, 2004 3:03 PM
To: Salyzyn, Mark
Cc: Mark Haverkamp; Alan Cox; linux-kernel; linux-scsi
Subject: RE: PATCH: Further aacraid work

On Tue, 29 Jun 2004, Salyzyn, Mark wrote:

> I believe this nails the problem too.
>
> However, there is a corner case condition lurking on this (See my
> currently unanswered email "error recovery and command completion" on
> linux-scsi) where I try to deal with completing a command while error
> recovery is triggered. Scsi_done will return doing *nothing*
effectively
> loosing the command completion.
>
> MarkH, I had talked to you about he addition of the scsi_add_timer
> before calling scsi_done to address this condition. I do not believe
> this to be the (Reliable and/or performance oriented) solution.
>
> Sincerely -- Mark Salyzyn

I've tested out both patches sent to me.

Test 1: aacraid-1.1.5-2245.tgz

Works flawlessly and speedily! The rsync completes, and doing a sync()
(as
called during a normal lilo update) takes roughly 1 second as opposed to
20
with the original aacraid patch from Alan Cox. Also, no SCSI hang
message ever
appears.

Test 2: Mark Haverkamp's linit.c patch

The "SCSI hang" console message appears just as before during the
'rsync',
however (unlike before) the device is still usable for roughly 30
seconds after
the problem. During these 30 seconds, the 'rsync' process is hung, but I
can
still do a 'df', 'ls', and so on. After 30 seconds, the entire /dev/sda
locks
up and I have no choice but to reboot the system.

-Byron

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/