Re: PATCH: Fix nbd on 2.1.118 SMP kernels

Stephen C. Tweedie (sct@redhat.com)
Thu, 27 Aug 1998 22:00:33 +0100


Hi,

On Thu, 27 Aug 1998 11:40:43 -0700 (PDT), Linus Torvalds
<torvalds@transmeta.com> said:

> It's otherwise ok, but this part is certainly buggy:

>> ...

> If you drop the io-request lock, and then not expect CURRENT to change,
> you're just fooling yourself. That's like saying "oh, btw, I now let
> people change the CURRENT queue, but by God if anybody actually does so
> I'll panic". That's just too stupid for me to accept as a patch.

No --- that's the whole point. The nbd driver always processes its
requests synchronously. Once a process starts to deal with CURRENT,
no other process will change the head request, and CURRENT is
guaranteed to remain there until we synchronously remove it. We know
for sure that nobody will change it while we block.

There _is_ a problem remaining, but that's not it.

> As far as I can tell, the code will unlink the request immediately
> afterwards anyway, so the fix is to just unlink it _before_ dropping the
> IO request lock, so that we don't have to worry about the request lists
> any more.

I thought so too at first, but in fact that just breaks things. CURRENT
is only ever modified by the do_nbd_request strategy function, and that
is only ever called when restarting an empty queue. Keeping CURRENT in
place while we do the (possibly blocking) send ensures that the strategy
routine is never reentered, so we know we are the only process sending
stuff.

The driver relies on this synchronised nature of CURRENT to prevent race
conditions. If we remove the CURRENT request from the queue before
blocking, then we can get two processes sending data down the socket at
the same time, and the stream gets corrupt. (Been there, tried that,
got blown up...)

This is certainly not the normal way for a block device driver to work,
but in this case assuming CURRENT stays put does work. The only way we
can possibly fall foul with the patched nbd is if one process sends an
nbd request and the nbd receive thread (which lives as a separate
process) gets woken up with a reply before the sending process gets
rescheduled. That case is detected as a bug by nbd_do_it. I could fix
it, but that was outside the scope of the patch I sent you. It would
require introduction of a new locking mechanism to ensure serialisation
of the request queue (nbd cannot deal with out-of-order request
replies).

I'll certainly look at addressing that other bug, but it doesn't
invalidate the patch I sent you.

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html