Re: [RFC] writev() semantics with invalid iovec in the middle

From: Al Viro
Date: Thu Sep 15 2016 - 18:29:57 EST


On Thu, Sep 15, 2016 at 06:23:24AM -0400, Mike Marshall wrote:
> If you squeeze out every byte won't you still have a short
> write? And the written data wouldn't be cut at the bad
> place, but it would have a weird hole or discontinuity there.

???

What I mean is that if we have an invalid address in the middle of a buffer
(unmapped, for example), we do not attempt to write every byte prior to that
invalid address. Of course what we write is going to be contiguous.

Suppose we have a buffer spanning 10 pages (amd64, so these are 4K ones) -
7 valid, 3 invalid:
VVVVIIIVV
and it starts 100 bytes into the first page. And write goes into a regular
file on e.g. tmpfs, starting at offset 31. We _can't_ write more than
4*4096-100 bytes, no matter what. It will be a short write. As the matter
of fact, it will be even shorter than that - it will be 3*4096-31 bytes,
up to the last pagecache boundary we can cover completely. That obviously
depends upon the filesystem - not everything uses pagecache, for starters.
However, the caller is *not* guaranteed that write() with an invalid page
in the middle of a buffer would write everything up to the very beginning
of the invalid page. A short write will happen, but the amount written
might be up to page size less than the actual length of valid part in the
beginning of the buffer.

Now, for writev() we could have invalid pages in any iovec; again, we
obviously can't write anything past the first invalid page - we'll get
either a short write or -EFAULT (if nothing got written). That's fine;
the question is what the caller can count upon wrt shortening.

Again, we are *not* guaranteed writing up to exact boundary. However, the
current implementation will end up shortening no more than to the iovec
boundary. I.e. if the first iovec contains only valid pages and there's
an invalid one in the second iovec, the current implementation will write
at least everything in the first iovec. That's _not_ promised by POSIX
or our manpages; moreover, I'm not sure if it's even true for each filesystem.
And keeping that property is actually inconvenient - if we could discard it,
we could make partial-copy ->write_end() calls a lot more infrequent.

Unfortunately, some of LTP writev tests end up checking that writev() does
behave that way - they feed it a three-element iovec with shorter-than-page
segments, the second of which is all invalid. And they check that the
entire first segment had been written.

I would really like to drop that property, making it "if some addresses
in the buffer(s) we are asked to write are invalid, the write will be
shortened by up to a PAGE_SIZE from the first such invalid address", making
writev() rules exactly the same as write() ones. Does anybody have objections
to it?