Re: ext2/3: document conditions when reliable operation is possible

From: Pavel Machek
Date: Mon Mar 23 2009 - 06:42:48 EST


On Mon 2009-03-16 14:26:23, Rob Landley wrote:
> On Monday 16 March 2009 07:28:47 Pavel Machek wrote:
> > Hi!
> > > > + Fortunately writes failing are very uncommon on traditional
> > > > + spinning disks, as they have spare sectors they use when write
> > > > + fails.
> > >
> > > I vaguely recall that the behavior of when a write error _does_ occur is
> > > to remount the filesystem read only? (Is this VFS or per-fs?)
> >
> > Per-fs.
>
> Might be nice to note that in the doc.

Ok, can you suggest a patch? I believe remount-ro is already
documented ... somewhere :-).

> > > I'm aware write errors shouldn't happen, and by the time they do it's too
> > > late to gracefully handle them, and all we can do is fail. So how do we
> > > fail?
> >
> > Well, even remount-ro may be too late, IIRC.
>
> Care to elaborate? (When a filesystem is mounted RO, I'm not sure what
> happens to the pages that have already been dirtied...)

Well, fsync() error reporting does not really work properly, but I
guess it will save you for the remount-ro case. So the data will be in
the journal, but it will be impossible to replay it...

> > > (Writes aren't always cleanly at the start of an erase block, so critical
> > > data _before_ what you touch is endangered too.)
> >
> > Well, flashes do remap, so it is actually "random blocks".
>
> Fun.

Yes.

> > > > + otherwise, disks may write garbage during powerfail.
> > > > + Not sure how common that problem is on generic PC machines.
> > > > +
> > > > + Note that atomic write is very hard to guarantee for RAID-4/5/6,
> > > > + because it needs to write both changed data, and parity, to
> > > > + different disks.
> > >
> > > These days instead of "atomic" it's better to think in terms of
> > > "barriers".
> >
> > This is not about barriers (that should be different topic). Atomic
> > write means that either whole sector is written, or nothing at all is
> > written. Because raid5 needs to update both master data and parity at
> > the same time, I don't think it can guarantee this during powerfail.
>
> Good point, but I thought that's what journaling was for?

I believe journaling operates on assumption that "either whole sector
is written, or nothing at all is written".

> I'm aware that any flash filesystem _must_ be journaled in order to work
> sanely, and must be able to view the underlying erase granularity down to the
> bare metal, through any remapping the hardware's doing. Possibly what's
> really needed is a "flash is weird" section, since flash filesystems can't be
> mounted on arbitrary block devices.

> Although an "-O erase_size=128" option so they _could_ would be nice. There's
> "mtdram" which seems to be the only remaining use for ram disks, but why there
> isn't an "mtdwrap" that works with arbitrary underlying block devices, I have
> no idea. (Layering it on top of a loopback device would be most
> useful.)

I don't think that works. Compactflash (etc) cards basically randomly
remap the data, so you can't really run flash filesystem over
compactflash/usb/SD card -- you don't know the details of remapping.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/