Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: Patrick McManus
Date: Sat May 31 2008 - 19:20:54 EST


On Sat, 2008-05-31 at 18:35 +0200, Ingo Molnar wrote:
> * Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxx> wrote:
>

> > ...setsockopt(listenfd, SOL_TCP, TCP_DEFER_ACCEPT, &val, sizeof(val))
> > seems to be the magic trick that is interestion here.
>
> seems to be used:
>
> 22003 write(3, "distccd[22003] (dcc_listen_by_ad"..., 62) = 62
> 22003 listen(4, 10) = 0
> 22003 setsockopt(4, SOL_TCP, TCP_DEFER_ACCEPT, [1], 4) = 0
>
> i'll queue up your reverts for testing in -tip.


So the code you will revert came from my fingers. The circumstances here
make me nervous; while I'm at a loss to explain what might be going on
in particular, let me offer an apology in advance should the revert help
resolve the issue.

Here's what makes me nervous:

* not a lot of code uses DEFER_ACCEPT.. frankly it was pretty broken
before 26 - but not broken this way .. the correlation of your bug using
it is significant.

* in 26, a server TCP socket (with DA) goes to ESTABLISHED when the 3rd
part of the handshake is received (as normal without DA), but the socket
isn't put on the accept queue until a real data packet arrives. (That's
the point of DA). In <= 25 this socket would have syn-recv until the
data packet arrived.

- I did run tests where the server died in between the handshake being
completed and first data packet arriving - the client should see RST and
the server socket should disappear. But maybe something was missed?

Do I understand this correctly, the server process is gone but the
socket is still in the table? And the client process is still there
waiting for the server to do something - having sent a bunch of data?

Do we know if any data bytes (not handshake bytes) have been consumed by
the server side? If they were, that would seem to vindicate DA.

Also pointing away from DA is that you started seeing this with rc3 -
that code was included in rc1.Is that a firm observation, or maybe there
weren't enough datapoints to conclude that rc1 and rc2 were clean?

The most interesting patch is ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
if anyone wants to eyeball it.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/