Re: Crashed Drive, libata wedges when trying to recover data

From: Eric Mudama
Date: Fri Sep 03 2004 - 12:18:47 EST


On 03 Sep 2004 12:47:00 -0400, Greg Stark <gsstark@xxxxxxx> wrote:
> Alan Cox <alan@xxxxxxxxxxxxxxxxxxx> writes:
>
> > On Gwe, 2004-09-03 at 16:58, Greg Stark wrote:
> > > I've even unmounted the filesystem and tried mounting it again. Now I can't
> > > even mount it without generating the error.
> >
> > You may well need to reset or powercycle the drive to get it back from
> > such a state.
>
> Certainly I know power cycling fixes it. That's what I've been doing so far.
>
> > > Sep 3 11:48:39 stark kernel: ata1: command 0x25 timeout, stat 0x59 host_stat 0x21
> > > Sep 3 11:48:39 stark kernel: ata1: status=0x59 { DriveReady SeekComplete DataRequest Error }
> > > Sep 3 11:48:39 stark kernel: ata1: error=0x01 { AddrMarkNotFound }
> >
> > "Its dead Jim". Once you get a drive that dies totally (or just keeps
> > posting up a hardware fail) after the error you are into forensics
> > (and/or backup) land.
>
> There's nothing the driver can do to reset the drive or get back to a known
> good protocol state?
>
> The "ATA: abnormal status 0x59 on port 0xEFE7" makes me think it's just the
> driver getting out of sync with the drive. But i guess that would be hard to
> distinguish from the drive just going south.

Here's my guess at what is happening:

0x59/xx is an artifact of PATA drives being required to transfer bogus
data on an error to satisfy the way the DMA controller was programmed
at the start of the transfer. Most all drives used this same
technique in PIO modes too, sharing common transfer code in their
firmware.

This behavior, unfortunately, caused a ruckus in SATA ... At least
one SATA controller immediately starts sending HOLD primitives when
they see the error bit get set.
In PATA, you can asynchronously issue a soft reset or a new command,
to abort the data transfer for the invalid command. However, in SATA,
once the board starts sending HOLD primitives and the drive responds
with HOLDA, there's no way to transition into X_RDY to send the soft
reset or a new command. Boom, you're deadlocked. This means that in
SATA, the only way to overcome this deadlock in the driver is to have
the host/board generate a COMRESET OOB burst to hard-reset the drive.

Today's (and tomorrow's) generation of SATA drives will never ever
generate a 0x59 status... the error and DRQ bits become mutually
exclusive. However, unfortunately, there are quite a few drives in
the field which have this behavior...

--eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/