Re: O_DIRECT question

From: Trond Myklebust
Date: Thu Jan 11 2007 - 14:50:09 EST


On Thu, 2007-01-11 at 11:00 -0800, Linus Torvalds wrote:
>
> On Thu, 11 Jan 2007, Trond Myklebust wrote:
> >
> > For NFS, the main feature of interest when it comes to O_DIRECT is
> > strictly uncached I/O. Replacing it with POSIX_FADV_NOREUSE won't help
> > because it can't guarantee that the page will be thrown out of the page
> > cache before some second process tries to read it. That is particularly
> > true if some dopey third party process has mmapped the file.
>
> You'd still be MUCH better off using the page cache, and just forcing the
> IO (but _with_ all the page cache synchronization still active). Which is
> trivial to do on the filesystem level, especially for something like NFS.
>
> If you bypass the page cache, you just make that "dopey third party
> process" problem worse. You now _guarantee_ that there are aliases with
> different data.

Quite, but that is sometimes an admissible state of affairs.

One of the things that was infuriating when we were trying to do shared
databases over the page cache was that someone would start some
unsynchronised process that had nothing to do with the database itself
(it would typically be a process that was backing up the rest of the
disk or something like that). Said process would end up pinning pages in
memory, and prevented the database itself from getting updated data from
the server.

IOW: the problem was not that of unsynchronised I/O per se. It was
rather that of allowing the application to set up its own
synchronisation barriers and to ensure that no pages are cached across
these barriers. POSIX_FADV_NOREUSE can't offer that guarantee.

> Of course, with NFS, the _server_ will resolve any aliases anyway, so at
> least you don't get file corruption, but you can get some really strange
> things (like the write of one process actually happening before, but being
> flushed _after_ and overriding the later write of the O_DIRECT process).

Writes are not the real problem here since shared databases typically do
implement sufficient synchronisation, and NFS can guarantee that only
the dirty data will be written out. However reading back the data is
problematic when you have insufficient control over the page cache.

The other issue is, of course, that databases don't _want_ to cache the
data in this situation, so the extra copy to the page cache is just a
bother. As you pointed out, that becomes less of an issue as processor
caches and memory speeds increase, but it is still apparently a
measurable effect.

Cheers
Trond

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/