Re: FYI: RAID5 unusably unstable through 2.6.14
From: David R
Date: Tue Jan 17 2006 - 14:57:38 EST
Cynbe ru Taren wrote:
> The current Linux kernel RAID5 implementation is just
> too fragile to be used for most of the applications
> where it would be most useful.
I'm not sure I agree.
> What happens repeatedly, at least in my experience over
> a variety of boxes running a variety of 2.4 and 2.6
> Linux kernel releases, is that any transient I/O problem
> results in a critical mass of RAID5 drives being marked
> 'failed', at which point there is no longer any supported
What "transient" I/O problem would this be. I've had loads of issues with
flaky motherboard/PCI bus implementations that make RAID using addin cards
(all 5 slots filled with other devices) a nightmare. The built in controllers
seem to be more reliable.
> way of retrieving the data on the RAID5 device, even
> though the underlying drives are all fine, and the underlying
> data on those drives almost certainly intact.
This is no problem, just use something like
mdadm --assemble --force /dev/md5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
/dev/sde1
(Then of course do a fsck)
You can even do this with (nr.drives-1), then add in the last one to be
sync'ed up in the background.
> This has just happened to me for at least the sixth time,
> this time in a brand new RAID5 consisting of 8 200G hotswap
> SATA drives backing up the contents of about a dozen onsite
> and offsite boxes via dirvish, which took me the better part
> of December to get initialized and running, and now two weeks
> later I'm back to square one.
:-( .. maybe try the force assemble?
> I'm currently digging through the md kernel source code
> trying to work out some ad-hoc recovery method, but this
> level of flakiness just isn't acceptable on systems where
> reliable mass storage is a must -- and when else would
> one bother with RAID5?
It isn't flaky for me now I'm using a better quality motherboard, in fact it's
saved me through 3 near simultaneous failures of WD 250GB drives.
> We need RAID5 to be equally resilient in the face of
> real-world problems, people -- it isn't enough to
> just be able to function under ideal lab conditions!
I think it is. The automatics are paranoid (as they should be) when failures
are noticed. The array can be assembled manually though.
> A design bug is -still- a bug, and -still- needs to
> get fixed.
It's not a design bug - in my opinion.
> Something HAS to be done to make the RAID5 logic
> MUCH more conservative about destroying RAID5
> systems in response to a transient burst of I/O
> errors, before it can in good conscience be declared
If such things are common you should investigate the hardware.
> ready for production use -- or at MINIMUM to provide
> a SUPPORTED way of restoring a butchered RAID5 to
> last-known-good configuration or such once transient
> hardware issues have been resolved.
It is. See above.
> In the meantime, IMHO Linux RAID5 should be prominently flagged
> EXPERIMENTAL -- NONCRITICAL USE ONLY or some such, to avoid
> building up ill-will and undeserved distrust of Linux
> software quality generally.
I'd calm down if I were you.
Cheers
David
Attachment:
signature.asc
Description: OpenPGP digital signature