Re: sys_write() racy for multi-threaded append?

From: Michael K. Edwards
Date: Fri Mar 09 2007 - 06:53:20 EST


On 3/8/07, Eric Dumazet <dada1@xxxxxxxxxxxxx> wrote:
Dont even try, you *cannot* do that, without breaking the standards, or
without a performance drop.

The only safe way would be to lock the file during the whole read()/write()
syscall, and we dont want this (this would be more expensive than current)
Dont forget 'file' may be some sockets/tty/whatever, not a regular file.

I'm not trying to provide full serialization of concurrent
multi-threaded read()/write() in all exceptional scenarios. I am
trying to think through the semantics of pipelined I/O operations on
struct file. In the _absence_ of an exception, something sane should
happen -- and when you start at f_pos == 1000 and write 100 bytes each
from two threads (successfully), leaving f_pos at 1100 is not sane.

Standards are saying :

If an error occurs, file pointer remains unchanged.

From 1003.1, 2004 edition:

<standard>
This volume of IEEE Std 1003.1-2001 does not specify the value of the
file offset after an error is returned; there are too many cases. For
programming errors, such as [EBADF], the concept is meaningless since
no file is involved. For errors that are detected immediately, such as
[EAGAIN], clearly the pointer should not change. After an interrupt or
hardware error, however, an updated value would be very useful and is
the behavior of many implementations.

This volume of IEEE Std 1003.1-2001 does not specify behavior of
concurrent writes to a file from multiple processes. Applications
should use some form of concurrency control.
</standard>

The effect on f_pos of an error during concurrent writes is therefore
doubly unconstrained. In the absence of concurrent writes, it is
quite harmless for f_pos to have transiently contained, at some point
during the execution of write(), an overestimate of the file position
after the write(). In the presence of concurrent writes (let us say
two 100-byte writes to a file whose f_pos starts at 1000), it is
conceivable that the second write will succeed at f_pos == 1100 but
the first will be short (let us say only 50 bytes are written),
leaving f_pos at 1150 and no bytes written in the range 1050 to 1099.
That would suck -- but the standard does not oblige you to avoid it
unless the destination is a pipe or FIFO with O_NONBLOCK clear, in
which case partial writes are flat out verboten.

You cannot know for sure how many bytes will be written, since write() can
returns a count that is different than buflen.

Of course (except in the pipe/FIFO case). But if it does, and you're
writing concurrently to the fd, you're not obliged to do anything sane
at all. If you're not writing concurrently, the fact that you
overshot and then fixed it up after vfs_write() returned is _totally_
invisible. f_pos is local to the struct file.

So you cannot update fpos before calling vfs_write()

You can speculatively update it, and in the common case you don't have
to touch it again. That's a win.

About your L1 'miss', dont forget that multi-threaded apps are going to
atomic_dec_and_test(&file->f_count) anyway when fput() is done at the end of
syscall. And you were concerned about multi-threaded apps, didnt you ?

That does indeed interfere with the optimization for multi-threaded
apps. Which doesn't mean it isn't worth having for single-threaded
apps. And if we get to the point that that atomic_dec_and_test is the
only thing left (in the common case) that touches the struct file
after a VFS operation completes, then we can evaluate whether f_count
ought to be split out of the struct file and put somewhere else. In
fact, if I understand the calls inside vfs_write() correctly, f_count
is (usually?) the only member of struct file that is written to during
a call to sys_pwrite64(); so moving it out of struct file and into
some place where it has to be kept cache-coherent anyway would also
improve the performance on SMP of distributed pwrite() calls to a
process-global fd.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/