Re: [PATCH] Write the inode itself in block_fsync()

From: Bart Samwel
Date: Fri Mar 10 2006 - 12:30:40 EST


OGAWA Hirofumi wrote:
Bart Samwel <bart@xxxxxxxxx> writes:

Andrew Morton wrote:
Sam Vilain <sam@xxxxxxxxxx> wrote:
OGAWA Hirofumi wrote:
>>
Ouch... won't that halve performance of database transaction logs?
Yes, it could well cause a lot more seeking to do atime and/or mtime
writes. Which aren't terribly important, really.

Unless I'm missing something, I suspect we'd be better off without this,
even though it's a correctness fix :(
Maybe atime/mtime aren't important, but I would be unhappy if a file size change wasn't written to disk on fsync.

Please don't worry, we should be doing a right thing for normal files
already. This patch is just for block device file.

Ahhh, I missed that. I interpreted:

>For block device's inode, we don't write a inode's meta data
>itself. But, I think we should write inode's meta data for fsync().

as "for block devices we don't, for normal files, yes", but apparently that's not what you meant. :-)

Anyway, shouldn't databases be using a combination of fixed-size files and fdatasync? fsync doesn't perform well by definition, and I guess the only reason databases still use it is because the kernel failed to implement the sucky part of the behaviour.

Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so,
usually this patch doesn't change fdatasync() at all.

Umm... however, I also can understand what akpm says.... check some databases.

berkeley db 4.4: use fdatasync() if available
mysql 5.0: use fdatasync() if available (innobase)
use fsync() (bdb)
postgresql: use fdatasync() if available
sqlite: use fsync

Nice piece of info. Apparently all of the "large" database engines can use fdatasync, only the smaller ones (bdb, sqlite) don't. I've done some extra research:

* From a quick look at the docs it seems to me that bdb can't be configured to put its transaction log directly on a block device, so bdb won't be affected.

* SQLite definitely can't write logs to a block device, the docs explicitly say that the transaction log is a regular file with a specific name, so we can write off sqlite as well. (It does seem to use fdatasync btw, since version 3.2.6, see http://www.sqlite.org/changes.html.)

If we've missed none, that leaves only proprietary databases at risk. But I would be genuinely surprised if a database like Oracle would use fsync. If we assume that Oracle et al. are not a problem, the risks of this patch are very low.

Cheers,
Bart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/