Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: HÃkon LÃvdal
Date: Sat May 31 2008 - 10:26:01 EST


2008/5/28 Peter Zijlstra <peterz@xxxxxxxxxxxxx>:
> Just a quick note to say, me too!!
>
> same scenario: distcc on localhost.

Me too, however with a completely different scenario; my hung connections
are not related to distcc at all. The output from /proc/net/tcp that Ingo
posted a few days ago are somewhat different from mine, however I believe
this is the same problem or at least related. Just as Ingo experienced,
netstat -p only shows PID/program as '-' for the hung connections while
for other connections it shows the expected results.

I have recently bought a new PC and have started the process of copying
stuff from my old PC to the new PC. During this I have experienced this
hang several times. I started copying by using tar on both ends over a ssh
pipe but in order to eliminate possible ssh problems I also have tried tar
over a ttcp connection which also fails. There is no obvious pattern of
when this happens, I have experienced failures after transferring 1.15GB,
51.4GB and 23.6GB.

Here is the output from netstat -n -o filtered for port 22 and slightly
edited. All the lines started with Proto == tcp and Recv-Q == 0.

Send-Q Local Addr Foreign Addr State Timer
0 old_pc:22 new_pc:52667 ESTABLISHED keepalive (3513.93/0/0)
0 old_pc:22 new_pc:43825 ESTABLISHED keepalive (5467.38/0/0)
2896 old_pc:22 new_pc:58601 ESTABLISHED on (21020884.65/0/0)
4344 old_pc:22 new_pc:54105 ESTABLISHED on (21017016.33/0/0)
2896 old_pc:22 new_pc:34149 ESTABLISHED on (20986889.24/0/0)

The first two connections are ongoing, working, interactive ssh
connections. The other three connections died days ago on my new PC.

One thing that caught my eyes was these very high timer values.
Checking the netstat source reveals that the value printed is "(double)
time_len / HZ" and that time_len is extracted from /proc/net/tcp. While
my CONFIG_HZ is 1000, I assume netstat has picked up HZ as 100 from
/usr/include/asm/param.h, and then things really seems to imply that
there is some integer overflow since 2^31 = 2147483648.

Looking into get_tcp4_sock in net/ipv4/tcp_ipv4.c I see that timer_expires
is initialized with icsk->icsk_timeout for the troublesome cases. But
here my competence to trace this further stops, so I have no idea of
how icsk->icsk_timeout gets such high values.

My old PC is currently still running with these stalled connections
present so let me know if there is something I should try to investigate
further. I can post output from /proc/net/tcp and my .config if you want
to have a look. My old PC is 32 bit/Celeron single core, kernel 2.6.24,
while my new is 64 bit/Q9300 quad core, kernel 2.6.25.3. The ethernet
cards are the following:

02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056
PCI-E Gigabit Ethernet Controller (rev 12)

BR HÃkon LÃvdal
¢éì®&Þ~º&¶¬–+-±éÝ¥Šw®žË±Êâmébžìdz¹Þ)í…æèw*jg¬±¨¶‰šŽŠÝj/êäz¹ÞŠà2ŠÞ¨è­Ú&¢)ß«a¶Úþø®G«éh®æj:+v‰¨Šwè†Ù>Wš±êÞiÛaxPjØm¶Ÿÿà -»+ƒùdš_