Re: [PATCH] Write the inode itself in block_fsync()

From: Bart Samwel
Date: Fri Mar 10 2006 - 09:10:47 EST


Andrew Morton wrote:
Sam Vilain <sam@xxxxxxxxxx> wrote:
OGAWA Hirofumi wrote:
>>
>For block device's inode, we don't write a inode's meta data
>itself. But, I think we should write inode's meta data for fsync().

Ouch... won't that halve performance of database transaction logs?

Yes, it could well cause a lot more seeking to do atime and/or mtime
writes. Which aren't terribly important, really.

Unless I'm missing something, I suspect we'd be better off without this,
even though it's a correctness fix :(

Maybe atime/mtime aren't important, but I would be unhappy if a file size change wasn't written to disk on fsync.

Anyway, shouldn't databases be using a combination of fixed-size files and fdatasync? fsync doesn't perform well by definition, and I guess the only reason databases still use it is because the kernel failed to implement the sucky part of the behaviour.

A complex but perhaps viable suggestion: as atime/mtime are stored on disk in second granularity (on ext3 at least, don't know about other fss), wouldn't it somehow be possible to only regard atime/mtime changes as real changes when i_(a|c)time.tv_sec changes? This would enable fsync to write the inode once every second instead of on every fsync. The performance drop would be much less dramatic than writing the inode on every fsync, and it would at least yield correct behaviour.

Cheers,
Bart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/