Re: sys_write() racy for multi-threaded append?

From: Michael K. Edwards
Date: Sat Mar 10 2007 - 01:44:17 EST


I apologize for throwing around words like "stupid". Whether or not
the current semantics can be improved, that's not a constructive way
to characterize them. I'm sorry.

As three people have ably pointed out :-), the particular case of a
pipe/FIFO isn't seekable and doesn't need the f_pos member anyway
(it's effectively always O_APPEND). That's what I get for checking
against standards documents at 3AM. Of course, this has nothing to do
with the point that led me to comment on pipes/FIFOs (which was that
there exist file types that never return 0<ret<nbytes). And it was in
the context of a very explicit aside that f_pos is not _interesting_
on a pipe/FIFO, except as an indicator of total bytes written. You
could only peek at this with an (admittedly non-portable) llseek(fd,
0, SEEK_CUR) anyway -- which you would only do for diagnostic
purposes. But diagnosis of odd corner cases (rarely in my code,
usually in other people's) is what I do day in and day out, so for me
it would be worth having.

In any case, you're all right that the standard doesn't require you to
do anything useful with f_pos on a pipe/FIFO. But you're permitted to
make it useful if you want to:

<1003.1 lseek()>
The behavior of lseek() on devices which are incapable of seeking is
implementation-defined. The value of the file offset associated with
such a device is undefined.
</1003.1>

Tracking f_pos accurately when writes from multiple threads hit the
same fd (pipe or not) isn't portable, but I recall situations where it
would have been useful. And if f_pos has to be kept at all in the
uncontended case, it costs you little or nothing to do it in a
thread-safe manner -- as long as you don't overconstrain the semantics
such that you forbid the transient overshoot associated with a short
write. In fact, unless there's something I've missed, increasing
f_pos before entering vfs_write() happens to be _faster_ than the
current code for common load patterns, both single- and multi-threaded
(although getting the full benefit in the multi-threaded case will
take some fiddling with f_count placement).

I say it costs "little or nothing" only because altering an loff_t
atomically is not free. But even on x86, with its inability to
atomically modify any 64-bit entity in memory, an uncontended spinlock
on a cacheline already in L1 is so cheap that making the f_pos changes
atomic will (I think) be lost in the noise.

In any case, rewriting read_write.c is proving interesting. I'll let
you all know if anything comes of it. In the meantime, thanks for
your (really quite friendly under the circumstances) comments.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/