Re: Longstanding bug in tty_write/swap/nfs_readpage interaction

Linus Torvalds (torvalds@cs.helsinki.fi)
Fri, 13 Dec 1996 15:17:47 +0200 (EET)


On Fri, 13 Dec 1996, Wolfram Gloger wrote:
>
> Ever since the asynchronous NFS client code was installed (sometime in
> the 1.3.x series), I had modem getty processes (whose executables lie
> on an NFS-mounted system) mysteriously hang in the disk wait state.
> Now, I finally tracked down the problem. write_chan() from the serial
> driver code does:
>
> ...
> add_wait_queue(&tty->write_wait, &wait);
> while (1) {
> current->state = TASK_INTERRUPTIBLE;
> ...
> if(...)
> else
> c = tty->driver.write(tty, 1, b, nr);
> ...
> schedule();
> }
> current->state = TASK_RUNNING;
> remove_wait_queue(&tty->write_wait, &wait);

This is correct.

> ...
>
> But when tty->driver.write() is rs_write(), it does a memcpy_fromfs()
> (or whatever it is now called in 2.1; the problem probably remains the
> same) which may have to swap-in a page via nfs_readpage(), which only
> works when current->state is TASK_RUNNING (the asynchronous NFS code
> calls schedule()).

Hmm.. This looks like a bug in nfs_readpage(), or possibly the memory
management. The code above is correct, and these kinds of "lie and say
we're asleep" cases are needed to handle race conditions. Personally I
think this is a NFS problem - the page-read function shouldn't care about
process state details like TASK_RUNNING because reading a page isn't
really a "process" thing at all.

Note that it's entirely ok to _return_ from the page fault with
TASK_RUNNING - that's why there is a while-loop and the task state set to
TASK_INTERRUPTIBLE. So one option is to just add

current->state = TASK_RUNNING;

to the nfs page_read() function - or even better make the NFS page_read()
function totally unaware of these details.

> I've discussed this with Olaf Kirch, and he thinks that current->state
> should possibly be only set _after_ calling the tty->driver.write()
> handler, so as to maintain the invariant that every task can call
> schedule() without looking at current->state first.

No. That results in potential race conditions - a wake-up event that is
missed because it happens before we have gone to sleep but after we have
done some of the tests. The tty layer is correct, the bug is definitely in
the NFS layer (but it could also be "fixed" by just making the memory
management do the "current->state" stuff).

> But if the serial code isn't the only place where this situation can
> occur, perhaps the NFS code needs to be changed.

The tty code is definitely not the only palce, although the tty code does
happen to be one of the more usual cases where the wait-queues are used
like this. Also, more importantly, the whole idea with wait-queues is that
you can avoid races by "lying" and saying you're asleep a lot earlier than
you actually go to sleep, because that way you can catch wake-up events
over a longer timeframe and thus avoid race conditions.

Linus