Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?

From: Andrew Morton
Date: Sun Jun 01 2008 - 16:38:22 EST


On Sun, 1 Jun 2008 13:40:09 +0200 Pavel Machek <pavel@xxxxxxx> wrote:

> Hi!
>
> > > > > All I can say so far is that I find the same as you do:
> > > > > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> > > > > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> > > > >
> > > > > Which make the sync_file_range call pretty pointless: your usage seems
> > > > > perfectly reasonable to me, but somehow we've broken its behaviour.
> > > > > I'll be investigating ...
> > > >
> > > > It will block on disk queue fullness - sysrq-W will tell.
> > >
> > > Ah, thank you. What a disappointment, though it's understandable.
> > > Doesn't that very severely limit the usefulness of the system call?
> >
> > A bit. The request queue size is runtime tunable though.
>
> Which /sys is that?

/sys/block/sda/queue/nr_requests

> What happens if I set the queue size to pretty
> much infinity, will memory management die horribly?

In theory, no - it's always caused problems when the VM/VFS/FS layer
has relied upon request-queue exhaustion for throttling. Hence all
that code is supposed to work OK when there is no request-queue
blocking. Of course, (theory/practice != 1.0).

> > I expect major users of this system call will be applications which do
> > small-sized overwrites into large files, mainly databases. That is,
> > once the application developers discover its existence. I'm still
> > getting expressions of wonder from people who I tell about the
> > five-year-old fadvise().
>
> Hey, you have one user now, its called s2disk. But for this call to be
> useful, we'd need asynchronous variant... is there such thing?

Well if you're asking the syscall to shove more data into the block
layer than it can concurrently handle, sure, the block layer will
block. It's tunable...

It can still block in places, of course - we might need to do
synchronous reads to get at metadata and we'll need to allocate memory.

> Okay, I can fork and do the call from another process, but...

I sense a strangeness. What are you actually trying to do with all of this?

Bear in mind that sync_file_range() doesn't sync metadata (ie: indirect
blocks). So if they weren't already known to have been written, the
data isn't safe.

> - * range which are not presently under writeback.
> + * range which are not presently under writeback. Notice that even this this
> + * may and will block if you attempt to write more than request queue size.

um, OK. I'll fix the grammar a bit there.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/