Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: Ilpo Järvinen
Date: Sat May 31 2008 - 12:10:59 EST


Thanks for reporting!

On Sat, 31 May 2008, Håkon Løvdal wrote:

> posted a few days ago are somewhat different from mine, however I believe
> this is the same problem or at least related. Just as Ingo experienced,
> netstat -p only shows PID/program as '-' for the hung connections while
> for other connections it shows the expected results.

Hmm, are the other end's processes still there? ...I'd be interested to
know what they're doing at the moment...

> I have recently bought a new PC and have started the process of copying
> stuff from my old PC to the new PC. During this I have experienced this
> hang several times. I started copying by using tar on both ends over a ssh
> pipe but in order to eliminate possible ssh problems I also have tried tar
> over a ttcp connection which also fails. There is no obvious pattern of
> when this happens, I have experienced failures after transferring
> 1.15GB, 51.4GB and 23.6GB.
>
> Here is the output from netstat -n -o filtered for port 22 and slightly
> edited. All the lines started with Proto == tcp and Recv-Q == 0.

...The receiving end's state would be more interesting.

> Send-Q Local Addr Foreign Addr State Timer
> 0 old_pc:22 new_pc:52667 ESTABLISHED keepalive (3513.93/0/0)
> 0 old_pc:22 new_pc:43825 ESTABLISHED keepalive (5467.38/0/0)
> 2896 old_pc:22 new_pc:58601 ESTABLISHED on (21020884.65/0/0)
> 4344 old_pc:22 new_pc:54105 ESTABLISHED on (21017016.33/0/0)
> 2896 old_pc:22 new_pc:34149 ESTABLISHED on (20986889.24/0/0)
>
> The first two connections are ongoing, working, interactive ssh
> connections. The other three connections died days ago on my new PC.

Died? Do you mean that they don't exist all at the other end anymore?

> One thing that caught my eyes was these very high timer values.
> Checking the netstat source reveals that the value printed is "(double)
> time_len / HZ" and that time_len is extracted from /proc/net/tcp. While
> my CONFIG_HZ is 1000, I assume netstat has picked up HZ as 100 from
> /usr/include/asm/param.h, and then things really seems to imply that
> there is some integer overflow since 2^31 = 2147483648.

...plain /proc/net/tcp would be much nicer to read and without all such
conversion troubles ;-).

> Looking into get_tcp4_sock in net/ipv4/tcp_ipv4.c I see that timer_expires
> is initialized with icsk->icsk_timeout for the troublesome cases. But
> here my competence to trace this further stops, so I have no idea of
> how icsk->icsk_timeout gets such high values.
>
> My old PC is currently still running with these stalled connections
> present so let me know if there is something I should try to investigate
> further.
>
> I can post output from /proc/net/tcp

For both ends that would be great.

> and my .config if you want to have a look.

Not needed I think.

> My old PC is 32 bit/Celeron single core, kernel 2.6.24,
> while my new is 64 bit/Q9300 quad core, kernel 2.6.25.3.
> The ethernet cards are the following:
>
> 02:0d.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> RTL-8139/8139C/8139C+ (rev 10)
> 02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056
> PCI-E Gigabit Ethernet Controller (rev 12)


--
i.