Re: MD/RAID: what's wrong with sector 1953519935?

From: Ric Wheeler
Date: Wed Aug 26 2009 - 11:38:59 EST


On 08/26/2009 10:46 AM, Andrei Tanas wrote:
On Wed, 26 Aug 2009 06:34:14 -0400, Ric Wheeler<rwheeler@xxxxxxxxxx>
wrote:
On 08/25/2009 11:45 PM, Andrei Tanas wrote:
I would suggest that Andrei might try to write and clear the IO

error

at that
offset. You can use Mark Lord's hdparm to clear a specific sector or
just do the
math (carefully!) and dd over it. It the write succeeds (without
bumping your
remapped sectors count) this is a likely match to this problem,

I've tried dd multiple times, it always succeeds, and the relocated

sector

count is currently 1 on this drive, even though this particular fault
happened at least 3 times so far.


I would bump that count way up (say to 2) and see if you have an
issue...

Not sure what you mean by this: how can I artificially bump the
relocated
sector count?


Sorry - you need to set the tunable:

/sys/block/mdX/md/safe_mode_delay

to something like "2" to prevent that sector from being a hotspot...

I did that as soon as you suggested that it's possible to tune it. The
array is still being rebuilt (it's a fairly busy machine, so rebuilding is
slow). I'll monitor it, but I don't expect to see the results soon as even
with the default value of 0.2 it used to happen once in several weeks.

On the other note: is it possible that the drive was actually working
properly but was not given enough time to complete the write request? These
newer drives have 32MB cache but the same rotational speed and seek times
as the older ones so they must need more time to flush their cache?

Andrei.


Timeouts on IO requests are pretty large, usually drives won't fail an IO unless there is a real problem but I will add the linux-ide list to this response so they can weigh in.

I suspect that the error was real, but might be this "repairable" type of adjacent track issue I mentioned before. Interesting to note that just following the error, you see that it was indeed the super block that did not get updated...

The error you referenced was:

90307.328266] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
frozen
[90307.328275] ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[90307.328277] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4
(timeout)
[90307.328280] ata2.00: status: { DRDY }
[90307.328288] ata2: hard resetting link
[90313.218511] ata2: link is slow to respond, please be patient (ready=0)
[90317.377711] ata2: SRST failed (errno=-16)
[90317.377720] ata2: hard resetting link
[90318.251720] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[90318.338026] ata2.00: configured for UDMA/133
[90318.338062] ata2: EH complete
[90318.370625] end_request: I/O error, dev sdb, sector 1953519935
[90318.370632] md: super_written gets error=-5, uptodate=0


Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/