Re: libata in 2.4.24?

From: Jeff Garzik
Date: Tue Dec 02 2003 - 15:37:44 EST


On Tue, Dec 02, 2003 at 03:10:19PM -0500, Greg Stark wrote:
> Jeff Garzik <jgarzik@xxxxxxxxx> writes:
>
> > So, today, no acknowledgement occurs until the data _really_ is in the
> > drive's buffers.
>
> The drive's buffers isn't good enough. If power is lost the write will be lost
> and the database corrupt. It needs to be on the platters.

Certainly agreed.


> > > This doesn't happen with SCSI disks where multiple requests can be pending so
> > > there's no urgency to reporting a false success. The request doesn't complete
> > > until the write hits disk. As a result SCSI disks are reliable for database
> > > operation and IDE disks aren't unless write caching is disabled.
> >
> > This is not really true.
> >
> > Regardless of TCQ, if the OS driver has not issued a FLUSH CACHE (IDE)
> > or SYNCHRONIZE CACHE (SCSI), then the data is not guaranteed to be on
> > the disk media. Plain and simple.
>
> That doesn't agree with people's experience. People seem to find that SCSI
> drives never cache writes. This sort of makes sense since there's just not
> much reason to report a write success before the write can be performed.
> There's no performance advantage as long as more requests can be queued up.

Some IDE _and/or_ SCSI drives do not cache writes. For these drives,
the _absence_ of an OS flush-cache command still means your data gets
to the platter.

The core problem is not issuing a flush-cache command, it sounds like.
The drive technology (wcache, or no) is largely irrelevant.


> > If fsync(2) returns without a flush-cache, then your data is not
> > guaranteed to be on the disk. And as you noted, flush-cache destroys
> > performance.
>
> It's my understanding that it doesn't. There was some discussion in the past

eh? flush-cache very definitely hurts performance, on both IDE and
SCSI, for drives that support write caching.


> > There are three levels:
> >
> > a) Data is successfully transferred to the controller/drive queue (TCQ).
> > b) Data is successfully transferred to the drive's internal buffers.
> > c) The drive successfully transfers data to the media.
>
> Only the third is of interest to Postgres or other databases. In fact, I

Certainly.


> suspect only the third is of interest to other systems that are supposed to be
> reliable like MTAs etc. I think Wietse and others would be shocked if they
> were told fsync wasn't guaranteed to have waited until the writes had actually
> hit the media.

As well he should be :)

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/