Re: statfs() / statvfs() syscall ballsup...

From: Linus Torvalds
Date: Fri Oct 10 2003 - 11:04:59 EST



On Fri, 10 Oct 2003, Joel Becker wrote:
> > I hope disk-based databases die off quickly.
>
> As opposed to what? Not a challenge, just interested in what
> you think they should be.

I'm hoping in-memory databases will just kill off the current crop
totally. That solves all the IO problems - the only thing that goes to
disk is the log and the backups, and both go there totally linearly unless
the designer was crazy.

Yeah, I don't follow the db market, but it's just insane to try to keep
the on-disk data in any other format if you've got enough memory. Recovery
may take a long time (reading that whole backup into memory and redoing
the log will be pretty expensive), but replication should handle that
trivially.

> Where I work doesn't change the need for O_DIRECT. If your Big
> App has it's own cache, why copy the cache in the kernel?

Why indeed?

But why do you think you need O_DIRECT with very bad semantics to handle
this?

The kernel page cache does multiple things:
- staging area for letting the filesystem do blocking (ie this is why a
regular "write()" or "read()" doesn't need to care about alignment etc)
- a synchronization entity - making sure that a write and a read cannot
pass each other, and that mmap contents are always _coherent_.
- a cache

O_DIRECT throws the cache part away, but it throws out the baby with the
bathwater, and breaks the other parts. Which is why O_DIRECT breaks things
like disk scheduling in really subtle ways - think about writing and
reading to the same area on the disk, and re-ordering at all different
levels.

And the thing is, uncaching is _trivial_. It's not like it is hard to say
"try to get rid of these pages if they aren't mapped anywhere" and "insert
this user page directly into the page cache". But people are so fixated
with "direct to disk" that they don't even think about it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/