Re: fsync on large files

Michael Marxmeier (mike@msede.com)
Mon, 15 Feb 1999 21:41:47 +0100


Andi Kleen <ak@muc.de> wrote:
> This is a very important patch, because fsync() is needed in some
> transaction oriented databases (they have to call fsync() or fdatasync()
> on the log frequently to commit operations).

Exactly! Just imagine what the current ext2_sync_file() implementation
does to you with a 1 GB file and a few fsyncdata() per second.
The only viable workaround so far is the patch from Scott Laird
which reverts to fsync_dev() depending on file size.
This patch looks like the right approach.

> For some of them 4 blocks
> could be not enough though, perhaps a more generic version of your
> patch that handles more than 4 blocks could be found - like keeping
> the dirty blocks per inode on a special list.

Depending on database activity and transaction size this could
easily use much more than 4 blocks.

Theodore Y. Ts'o <tytso@MIT.EDU> wrote:
> If you look at my patch, you'll see that number of blocks saved is
> configurable by adjusting the NUM_EXT2_FFSYNC_BLKS #define. While it
> might make sense to increase that number, note that past a certain
> point, you might as well simply go through all of the indirect blocks.

>From my experience, going through the triple indirect blocks is
always calling for trouble. So if it takes too much overhead to keep
a bigger list of dirtied blocks, we might as well fall back to
fsync_dev() for bigger files.

> Adding blocks to the ffsync list is an order N-squared operation, due to
> the need to check to see if the block is already on the list.

We could use a more scalable algorithm, so having more entries would
not make such a big difference.
There is a tradeoff depending on usage: If you are dealing with
rather big files and use fdatasync() it makes a difference if you can
avoid the ext2_sync_file(). If not, you may not like the write()
performance beeing affected.

> We could increase NUM_EXT2_FFSYNC_BLKS, but it would be useful to get
> some actual performance results about how many blocks a typical
> transaction oriented database actually tends to write out before calling
> fsync() or fdatasync().

Define "typical". This depends on application software and
transaction
strategy and could vary for customers and db implementation.
Even with moderate database sizes (1 few 100 MB) having a larger list
of blocks would almost always be a big win.

> Anyone interesting in doing some data gathering?

I plan to run some performance tests during this week. I will add
some code to check for number of blocks dirtied between fsync().
I won't consider that "typical" though. It's just another datapoint.

It's probably also interesting to compare your patch against the
fsync_dev() workaround.

Michael

-- 
Michael Marxmeier           Marxmeier Software GmbH
E-Mail: mike@msede.com      Besenbruchstrasse 9
Voice : +49 202 2431440     42285 Wuppertal, Germany
Fax   : +49 202 2431420     http://www.msede.com/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/