Re: writing file to disk: not as easy as it looks

From: Mikulas Patocka
Date: Wed Dec 03 2008 - 10:34:43 EST

On Wed, 3 Dec 2008, Theodore Tso wrote:

> > Ok, "memory failed before disk" is ... bad hardware.
> It's PC class hardware. Live with it. Back when SGI made their own
> hardware, they noticed this problem, and so they wired up their SGI
> machines with powerfail interrupts, and extra big capacitors in their
> power supplies, and when Irix got a powerfail interrupt, it would
> frantically run around aborting DMA transfers to avoid this particular
> problem. At least, that's what an old-timer SGI engineer (who is
> unfortunately no longer at SGI) told me.

I heard this too --- I just don't understand why did they route it to an
interrupt and undertook the complicated sequence of aborting the commands
by the kernel --- instead of simply routing it to PCI reset line --- that
would reset the controller and stop it from feeding data to disks.

Also, if they had ECC memory, the chipset should detect unrecoverable
garbage and respond with target-abort or full system reset and not feed
bad data to the controller.

> PC class hardware don't have power fail interrupts. Hence, my advice
> to you is that if you use a filesystem that does logical journalling
> --- better have a UPS.

ATX has PWR_OK pin that should be deasserted on power failure before the
voltage drops.

I don't know if motherboards use it --- but there should be no problem
routing the pin to the chipset reset and stop it before power goes low.

> > ...but... you seem to be saying that modern filesystems can damage
> > data even on "sane" hardware.
> The example I gave was one where a disk failure could cause a file
> that had previously been sucessfully written to disk and fsync()'ed to
> be damaged by another filesystem operation ***in the face of hard
> drive failure***. Surely that is obvious. The most obvious case of
> that might be if the disk controller gets confused and slams a data
> block into the wrong location on disk (there's a reason why DIF
> includes the sector number in its checksum and why some enterprise
> databases do the same thing in their tablespace blocks --- it happens
> often enough that paranoid data integrity engineers worry about it).

You can read the block number back from ATA disk after you write it and
before you submit the command.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at