Re: MD/RAID: what's wrong with sector 1953519935?

From: Ric Wheeler
Date: Tue Aug 25 2009 - 21:31:48 EST


On 08/25/2009 09:24 PM, NeilBrown wrote:
On Wed, August 26, 2009 11:06 am, Ric Wheeler wrote:
On 08/25/2009 08:50 PM, NeilBrown wrote:

All 1TB drives are exactly the same size.
If you create a single partition (e.g. sdb1) on such a device, and that
partition starts at sector 63 (which is common), and create an md
array using that partition, then the superblock will always be at the
address you quote.
The superblock is probably updated more often than any other block in
the array, so there is probably an increased likelyhood of an error
being reported against that sector.

So it is not just a coincidence.
Whether there is some deeper underlying problem though, I cannot say.
Google only claims 68 matches for that number which doesn't seem
big enough to be significant.

NeilBrown



Neil,

One thing that can happen is when we have a hot spot (like the super
block) on high capacity drives is that the frequent write degrade the
data in adjacent tracks. Some drives have firmware that watches for
this and rewrites adjacent tracks, but it is also a good idea to avoid
too frequent updates.

Yet another detail to worry about.... :-(

it never ends :-)



Didn't you have a tunable to decrease this update frequency?

/sys/block/mdX/md/safe_mode_delay
is a time in seconds (Default 0.200) between when the last write to
the array completes and when the superblock is marked as clean.
Depending on the actual rate of writes to the array, the superblock
can be updated as much as twice in this time (once to mark dirty,
once to mark clean).

Increasing the number can decrease the update frequency of the superblock,
but the exact effect on update frequency is very load-dependant.

Obviously a write-intent-bitmap, which is rarely more that a few
sectors, can also see lots of updates, and it is harder to tune
that (you have to set things up when you create the bitmap).

NeilBrown


We did see issues in practice with adjacent sectors with some drives, so this one is worth tuning down.

I would suggest that Andrei might try to write and clear the IO error at that offset. You can use Mark Lord's hdparm to clear a specific sector or just do the math (carefully!) and dd over it. It the write succeeds (without bumping your remapped sectors count) this is a likely match to this problem,

ric




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/