Re: [sqlite] light weight write barriers

From: david
Date: Thu Oct 25 2012 - 02:58:49 EST


On Thu, 25 Oct 2012, Theodore Ts'o wrote:

On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:

By trusting fsync(). And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.

The challenge is when you have entagled metadata updates. That is,
you update file A, and file B, and file A and B might share metadata.
In order to sync file A, you also have to update part of the metadata
for the updates to file B, which means calculating the dependencies of
what you have to drag in can get very complicated. You can keep track
of what bits of the metadata you have to undo and then redo before
writing out the metadata for fsync(A), but that basically means you
have to implement soft updates, and all of the complexity this
implies: http://lwn.net/Articles/339337/

If you can keep all of the metadata separate, this can be somewhat
mitigated, but usually the block allocation records (regardless of
whether you use a tree, or a bitmap, or some other data structure)
tends of have entanglement problems.

hmm, two thoughts occur to me.

1. to avoid entanglement, put the two files in separate directories

2. take advantage of entaglement to enforce ordering


thread 1 (repeated): write new message to file 1, spawn new thread to fsync

thread 2: write to file 2 that message1-5 are being worked on

thread 2 (later): write to file 2 that messages 1-5 are done

when thread 1 spawns the new thread to do the fsync, the system will be forced to write the data to file 2 as of the time it does the fsync.

This should make it so that you never have data written to file2 that refers to data that hasn't been written to file1 yet.


It certainly is not impossible; RDBMS's have implemented this. On the
other hand, they generally aren't as fast as file systems for
non-transactional workloads, and people really care about performance
on those sorts of workloads for file systems.

the RDBMS's have implemented stronger guarantees than what we are needing

A few years ago I was investigating this for logging. With the reliable (RDBMS style) , but inefficent disk queue that rsyslog has, writing to a high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered stable enough at the time to be tested)

Still, if you want to try to implement such a thing, by all means,
give it a try. But I think you'll find that creating a file system
that can compete with existing file systems for performance, and
*then* also supports a transactional model, is going to be quite a
challenge.

The question is trying to figure a way to get ordering right with existing filesystms (preferrably without using something too tied to a single filesystem implementation), not try and create a new one.

The frustrating thing is that when people point out how things like sqlite are so horribly slow, the reply seems to be "well, that's what you get for doing so many fsyncs, don't do that", when there is a 'problem' like the KDE "config loss" problem a few years ago, the response is "well, that's what you get for not doing fsync"

Both responses are correct, from a purely technical point of view.

But what's missing is any way to get the result of ordered I/O that will let you do something pretty fast, but with the guarantee that, if you loose data in a crash, the only loss you are risking is that your most recent data may be missing. (either for one file, or using multiple files if that's what it takes)

Since this topic came up again, I figured I'd poke a bit and try to either get educated on how to do this "right" or try and see if there's something that could be added to the kernel to make it possible for userspace programs to do this.

What I think userspace really needs is something like a barrier function call. "for this fd, don't re-order writes as they go down through the stack"

If the hardware is going to reorder things once it hits the hardware, this is going to hurt performance (how much depends on a lot of stuff)

but the filesystems are able to make their journals work, so there should be some way to let userspace do some sort of similar ordering

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/