On Sat, 11 Feb 2006, Nick Piggin wrote:
No, you are thinking about what the kernel does. Subtle difference. A
smart user wants to:
- start writing this
- start writing that
- start writing that-other-thing
- make sure this that and the other have reached backing store
OK so in effect it is the same thing, but it is better to export the
interface that reflects how the user interacts with pagecache.
WRITE_SYNC obviously does the "wait for them all" (aka ensure they
hit backing store) thing too, right? It performs exactly the same
role that WRITE_WAIT would do in the above example.
NOOOOOO!
Think about it for a second. Think about the usage case you yourself were quoting.
The "magic" in IO is "overlapping IO". If you don't get overlapping IO, your interfaces are broken. End of story.
And WRITE_SYNC _cannot_ do overlapping IO.
It's entirely possible that somebody else (or that very same program) has dirtied the same pages that you started write-out on earlier. And that is when "wait for writes to finish" and "WRITE_SYNC" _differ_.
If you want synchronous writes, use synchronous writes. But if you want asynchronous writes, you do _not_ implement them as "start writes now" and "write synchronously". You implement them as "start writes now" and "wait for the writes to have finished".
There's another very specific and important difference: "wait for the writes" is fundamentally an interruptible and pollable operation, which means that it's a lot easier to integrate into any system that has to do other things too. In contrast, WRITE_SYNC is _neither_ easily interruptible nor pollable.
So WRITE_SYNC has clearly different behaviour. There's a good reason the kernel internally has "start write" + "wait for write", and I'll repeat: none of those reasons go away just because you move to user space.
My proposal isn't really different to Andrew's in terms of functionality
(unless I've missed something), but it is more consistent because it
does not introduce this completely new concept to our userspace API but
rather uses the SYNC/ASYNC distinction like everything else.
Your proposal has two _huge_ downsides:
- it changes semantics, and you have absolutely _no_ idea of who depends on the performance semantics of the old behaviour. In contrast, I can tell you that we did it once before, and we reverted it.
- it's not at all consistent. The _current_ behaviour is consistent, and matches 100% the current behaviour of sync vs async write().