Re: Linux kernel - Libata bad block error handling to user mode program

From: Robert Hancock
Date: Thu Mar 04 2010 - 21:16:15 EST


On Thu, Mar 4, 2010 at 8:11 PM, s ponnusa <foosaa@xxxxxxxxx> wrote:
>> There's nothing in libata which will cause the operation to eventually
>> return success if the drive keeps failing it (at least there definitely
>> should not be and I very much doubt there is). My guess is that somehow what
>> you think should be happening is not what the drive is actually doing (maybe
>> one of the retries you're seeing is actually succeeding in writing to the
>> disk, or at least the drive reports it was).
>>
>> You haven't posted any of the actual kernel output you're seeing, so it's
>> difficult to say exactly what's going on. However, attempting to scan for
>> disk errors using writes seems like a flawed strategy. As several people
>> have mentioned, drives can't necessarily detect errors on a write.
>>
>
> The scenario involves lots of bad drives with the known bad sectors
> locations. Take MHDD for example, it sends an ATA write command to one
> of the bad sectors, the drive returns failure / timeout, it tries
> again, the drive still says failure / timeout, program comes out and
> says failure. If we are not checking the errors during write process,
> and continue to reallocate the sector or retry the write again, what
> happens after all the available sectors are remapped? I still could
> not visualise it for some reasons.
>
> Consider this scenario:
> My write program says write passed. But when I used another
> verification program (replica of the erasure program but does only
> read / verify) it is unable to read the data and returns failure. No
> other program (for example a Windows based hex editor or DOS based
> disk editor) is able to read the information from that particular
> sector. So, obviously the data written by linux is corrupted and
> cannot be read back by any other means. And the program which wrote
> the data is unaware of the error that has happened at the lower level.
> But the error log clearly has the issue caught but is trying to handle
> differently.
>
> I've attached a part of sample dmesg log which was logged during the
> grinding of bad sector operation and eventually the write passed.

[ 7671.006928] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 7671.006936] ata1.00: BMDMA stat 0x25
[ 7671.006943] ata1.00: cmd c8/00:08:a8:56:75/00:00:00:00:00/e5 tag 0
dma 4096 in
[ 7671.006945] res 51/40:04:ac:56:75/10:02:05:00:00/e5 Emask
0x9 (media error)
[ 7671.006949] ata1.00: status: { DRDY ERR }
[ 7671.006951] ata1.00: error: { UNC }
[ 7671.028606] ata1.00: configured for UDMA/100
[ 7671.028617] ata1: EH complete

Command C8 is a read that's failing. It looks like almost all of the
failures in that log are from failed reads, I don't see any failed
writes. From what I can see it sounds like the drive is apparently
writing successfully but is unable to read the data back (the reads
being due to read-modify-write operations being done or for some other
reason).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/