Fundamental race condition in wait_event_interruptible_exclusive() ?

From: Linus Torvalds
Date: Sun Dec 08 2019 - 16:13:55 EST


So I'm looking at the thundering herd issue with pipes, and I have a
truly monumentally silly benchmark for it:

#include <unistd.h>

int main(int argc, char **argv)
{
int fd[2], counters[2];

pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));

/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();

do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}

which basically passes an integer token around (with "-1" being
"ignore" and being passed around occasionally as a test for partial
reads).

It passes that counter token around a million times, and every other
time it also writes that "-1" case, so in a perfect world, you'd get
1.5 million scheduling events, because that's how many write() ->
read() pairings there are.

Even in a perfect world there will be a few extra scheduling events
because the whole "finish the pipeline" waits for _everybody_ to see
that one million number, so the counter token gets passed around a bit
more than exactly a million times, but that's in the noise.

In reality, I get many more than that, because the write will wake up
all the readers, even if only one (or two) get any values. You can see
it easily with "perf stat" or whatever.

Whatever, that's not the important part. The benchmark is obviously
entirely artificial and not that interesting, it's just to show a
point.

I do have a tentative patch to fix it - use separate reader and writer
wake queues, and exclusive waits.

But when I was looking at using exclusive waits, I noticed an issue.
It looks like wait_event_interruptible_exclusive() is fundamentally
untrustworthy.

What I'd _like_ to do is basically something like this in pipe_read():

.. read the data, pipe is now empty ..
__pipe_unlock();
.. wake writers if the pipe used to be full ..

// Wait for more data
err = wait_event_interruptible_exclusive(pipe->rd_wait,
pipe_readable());
if (err)
return err;

// Remember that we need to wake up the next reader
// if we don't consume everything
wake_remaining_readers = true;

__pipe_lock();
.. loop back and read the data ..

but it looks like the above pattern is actually buggy (the above
doesn't have the parts at the end where we actually wake the next
reader if we didn't empty the pipe, I'm just describing the waiting
part).

The reason it is buggy is that wait_event_interruptible_exclusive()
does this (inside the __wait_event() macro that it expands to):

long __int = prepare_to_wait_event(&wq_head,
&__wq_entry, state);\

\
if (condition)
\
break;
\

\
if (___wait_is_interruptible(state) && __int) {
\
__ret = __int;
\
goto __out;
\

and the thing is, if does that "__ret = __int" case and returns
-ERESTARTSYS, it's possible that the wakeup event has already been
consumed, because we've added ourselves as an exclusive writer to the
queue. So it _says_ it was interrupted, not woken up, and the wait got
cancelled, but because we were an exclusive waiter, we might be the
_only_ thing that got woken up, and the wakeup basically got forgotten
- all the other exclusive waiters will remain waiting.

Yes, yes, I can make the read_pipe() logic be that I ignore the return
value entirely, and always take the pipe lock, and always loop back,
and then the reading code will check for signal_pending() there, and
do the wake_remaining_readers if there is still data to be read, and
just do this instead:

wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable());

// Remember that we need to wake up the next reader
// if we don't consume everything
wake_remaining_readers = true;

__pipe_lock();
.. loop back and read the data ..

but that's kind of sad. And the basic point is that the return value
from wait_event_interruptible_exclusive() seems to not really be
reliable. You can't really use it - even if it says you got
interrupted, you still have to go back and check the condition and do
the work, and only do interruptability handling after that.

I dunno. Maybe this is fundamental to the interface? We do not have a
lot of users that mix "interruptible" and "exclusive". In fact, I see
only two cases that care about the return value, but at least the fuse
use does seem to have exactly this problem with potentially lost
wakeups because the ERESTARTSYS case ends up eating a wakeup without
doing anything about it.

Looks like the other user - the USB gadget HID thing - also has this
buggy pattern of believing the return value, and losing a wakeup
event.

Adding Miklos and Felipe to the cc just because of the fuse and USB
gadget potential issues, but this is mainly a scheduler maintainer
question.

It's possible that I've misread the wait-event code. PeterZ?

Linus