Re: [sqlite] light weight write barriers

From: david
Date: Thu Oct 25 2012 - 03:11:42 EST


On Thu, 25 Oct 2012, Theodore Ts'o wrote:

On Wed, Oct 24, 2012 at 03:03:00PM -0700, david@xxxxxxx wrote:
Like what is being described for sqlite, loosing the tail end of the
messages is not a big problem under normal conditions. But there is
a need to be sure that what is there is complete up to the point
where it's lost.

this is similar in concept to write-ahead-logs done for databases
(without the absolute durability requirement)

If that's what you require, and you are using ext3/4, usng data
journalling might meet your requirements. It's something you can
enable on a per-file basis, via chattr +j; you don't have to force all
file systems to use data journaling via the data=journalled mount
option.

The potential downsides that you may or may not care about for this
particular application:

(a) This will definitely have a performance impact, especially if you
are doing lots of small (less than 4k) writes, since the data blocks
will get run through the journal, and will only get written to their
final location on disk.

(b) You don't get atomicity if the write spans a 4k block boundary.
All of the bytes before i_size will be written, so you don't have to
worry about "holes"; but the last message written to the log file
might be truncated.

(c) There will be a performance impact, since the contents of data
blocks will be written at least twice (once to the journal, and once
to the final location on disk). If you do lots of small, sub-4k
writes, the performance might be even worse, since data blocks might
be written multiple times to the journal.

I'll have to dig into this option. In the case of rsyslog it sounds like it could work (not as good as a filesystem independant way of doing things, but better than full fsyncs)

Truncated messages are not great, but they are a detectable, and acceptable risk.

while the average message size is much smaller than 4K (on my network it's ~250 bytes), the metadata that's broken out expands this somewhat, and we can afford to waste disk space if it makes things safer or more efficient.

If we do update in place with flags with each message, each message will need to be written up to three times (on recipt, being processed, finished processed). With high message burst rates, I'm worried that we would fill up the journal, is there a good way to deal with this?

I believe that ext4 can put the journal on a different device from the filesystem, would this help a lot?

If you were to put the journal for an ext4 filesystem on a ram disk, you would loose the data recovery protection of the journal, but could you use this trick to get ordered data writes onto the filesystem?

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/