Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: Ilpo Järvinen
Date: Mon May 26 2008 - 12:32:32 EST


On Mon, 26 May 2008, Ingo Molnar wrote:

>
> * Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxx> wrote:
>
> > On Mon, 26 May 2008, Ingo Molnar wrote:
> >
> > > there's a hung distcc task on the system, waiting for socket action
> > > forever:
> > >
> > > [root@europe ~]# strace -fp 19578
> > > Process 19578 attached - interrupt to quit
> > > select(5, NULL, [4], [4], {82, 90000} <unfinished ...>
> >
> > Hmm, readfds is NULL isn't it?!? Are you sure you straced the right
> > process?
>
> yes, i'm stracing the task that is hung unexpectedly.

But that wasn't the receiving process? (I didn't quickly find into which
direction distcc ports go, so I couldn't confirm this). If you still have
that situation at hand, could you check which is the receiving process
(e.g., using netstat -p, the end which has Recv-Q is the right one) and
where it's stuck?

> > > disturbing that task via strace did not change the state of the
> > > socket - and that's not unexpected as it's a select(). [TCP state
> > > might be affected if strace impacted a recvmsg or a sendmsg wait
> > > directly.]
> >
> > I fail to understand this paragraph due to excessive negation... :-)
>
> i mean, sometimes a TCP connection can get 'unstuck' if you strace a
> task - that is because the TCP related syscall the task sits in gets
> interrupted. But in this case it's select() which doesnt explicitly take
> the socket, doesnt do any tcp_push_pending_frames() processing, etc. -
> it just its on the socket waitqueue AFAICS. And that's expected.

This is not in the sender end at all. It's correct behavior of the flow
control to stop the sender until more room is made available by the
reading end. Thus push_pending_frames couldn't send anything.

...It may still be that the receiving process is stuck due to the non-net
related changes you have there.

--
i.