star3> netstat -a -t
Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 *:sunrpc                *:*                     LISTEN      
tcp        0      0 *:ftp                   *:*                     LISTEN      
tcp        0      0 *:telnet                *:*                     LISTEN      
tcp        0      0 *:gopher                *:*                     LISTEN      
tcp        0      0 *:shell                 *:*                     LISTEN      
tcp        0      0 *:login                 *:*                     LISTEN      
tcp        0      0 *:pop-2                 *:*                     LISTEN      
tcp        0      0 *:pop                   *:*                     LISTEN      
tcp        0      0 *:imap                  *:*                     LISTEN      
tcp        0      0 *:finger                *:*                     LISTEN      
tcp        0      0 *:time                  *:*                     LISTEN      
tcp        0      0 *:auth                  *:*                     LISTEN      
tcp        0      0 *:857                   *:*                     LISTEN      
tcp        0      0 *:smtp                  *:*                     LISTEN      
tcp        0      0 *:10025                 *:*                     LISTEN      
tcp        0      3 star3.messier:login     starzero.messier:1013   ESTABLISHED 
tcp        0      0 star3.messier:shell     star1.messier:1019      ESTABLISHED 
tcp        0      0 star3.messier:1023      star1.messier:1018      ESTABLISHED 
tcp        0      0 star3.messier:1147      star1.messier:1110      ESTABLISHED 
tcp        0      0 *:1148                  *:*                     LISTEN      
tcp        0      0 star3.messier:shell     star1.messier:1008      ESTABLISHED 
tcp        0      0 star3.messier:1022      star1.messier:1005      ESTABLISHED 
tcp        0      0 star3.messier:1149      star1.messier:1110      ESTABLISHED 
tcp        0      0 *:1150                  *:*                     LISTEN      
tcp        0      0 star3.messier:1152      star2.messier:1130      ESTABLISHED 
tcp        0      0 star3.messier:1155      star3.messier:1154      ESTABLISHED 
tcp        0      0 star3.messier:1154      star3.messier:1155      ESTABLISHED 
tcp        0      0 star3.messier:1150      star1.messier:1121      TIME_WAIT   
tcp        0      0 star3.messier:1156      star1.messier:1122      ESTABLISHED 
tcp        0      0 star3.messier:1148      star4.messier:1134      TIME_WAIT   
tcp        0      0 star3.messier:1150      star4.messier:1136      TIME_WAIT   
tcp        0      0 star3.messier:1159      star4.messier:1138      ESTABLISHED 
tcp        0   7796 star3.messier:1160      star4.messier:1139      ESTABLISHED 
tcp        0      0 star3.messier:1148      star1.messier:1125      TIME_WAIT   
tcp        0      0 star3.messier:1162      star1.messier:1127      ESTABLISHED 
Recv-Q on star4 for "star3.messier:1160      star4.messier:1139"  is empty. It 
stays this way until the connection times out. This is the tail of the tcpdump 
log for this connection. All traffic stops after several spurious duplicated acks. 
...
13:49:31.298117 star4.messier.1139 > star3.messier.1160: . 1848225:1849673(1448) ack 1851121 win 7240 <nop,nop,timestamp 56303 57114> (DF) [tos 0x18] (ttl 64, id 16102)
13:49:31.298173 star3.messier.1160 > star4.messier.1139: . ack 1849673 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50303)
13:49:31.298125 star4.messier.1139 > star3.messier.1160: . ack 1854017 win 5792 <nop,nop,timestamp 56303 57114> (DF) [tos 0x18] (ttl 64, id 16103)
13:49:31.298367 star4.messier.1139 > star3.messier.1160: . 1849673:1851121(1448) ack 1855465 win 4344 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16107)
13:49:31.299299 star4.messier.1139 > star3.messier.1160: . 1851121:1852569(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16112)
13:49:31.299516 star4.messier.1139 > star3.messier.1160: . 1852569:1854017(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16113)
13:49:31.299573 star3.messier.1160 > star4.messier.1139: . ack 1854017 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50314)
13:49:31.299784 star4.messier.1139 > star3.messier.1160: . 1854017:1855465(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16114)
13:49:31.300049 star3.messier.1160 > star4.messier.1139: . ack 1856913 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50316)
13:49:31.300276 star4.messier.1139 > star3.messier.1160: . 1856913:1858361(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16117)
13:49:31.300588 star3.messier.1160 > star4.messier.1139: . ack 1859809 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50318)
13:49:31.301098 star3.messier.1160 > star4.messier.1139: . ack 1862705 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50320)
13:49:31.301089 star4.messier.1139 > star3.messier.1160: P 1862705:1863193(488) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16124)
13:49:31.301839 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16130)
13:49:31.301949 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16131)
13:49:31.302227 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16133)
13:49:31.302517 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16134)
13:49:31.302522 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16136)
13:49:31.309051 star3.messier.1160 > star4.messier.1139: P 1863193:1863225(32) ack 1863193 win 15928 <nop,nop,timestamp 57116 56303> (DF) [tos 0x18] (ttl 64, id 50382)
13:49:31.309072 star3.messier.1160 > star4.messier.1139: P 1863225:1863261(36) ack 1863193 win 15928 <nop,nop,timestamp 57116 56303> (DF) [tos 0x18] (ttl 64, id 50383)
13:49:31.309248 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56304 57116,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16149)
13:49:31.309306 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56304 57116,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16150)
13:49:31.493869 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57135 56304> (DF) [tos 0x18] (ttl 64, id 50388)
13:49:31.893864 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57175 56304> (DF) [tos 0x18] (ttl 64, id 50389)
13:49:32.693868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57255 56304> (DF) [tos 0x18] (ttl 64, id 50410)
13:49:34.293868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57415 56304> (DF) [tos 0x18] (ttl 64, id 50433)
13:49:37.493868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57735 56304> (DF) [tos 0x18] (ttl 64, id 50489)
13:49:43.893869 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 58375 56304> (DF) [tos 0x18] (ttl 64, id 50610)
13:49:56.693871 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 59655 56304> (DF) [tos 0x18] (ttl 64, id 50689)
13:50:22.293868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 62215 56304> (DF) [tos 0x18] (ttl 64, id 50878)
...etc, until timeout
Corresponding /proc/net/tcp record look like this:
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 0501A8C0:048A 0301A8C0:0467 01 00000000:00000000 00:00000000 00000000   520        0 3122
   1: 0501A8C0:0488 0601A8C0:0473 01 00001E74:00000000 01:00000CD0 00000008   520        0 3115
   2: 0501A8C0:0487 0601A8C0:0472 01 00000000:00000000 00:00000000 00000000   520        0 3114
   ...
Also, I have one more report about the same problem with MPI on dual PII 400 systems.
The hardware is slightly different (Gigabyte Ga-6BXDS boards, Tulip 21140 NICs) but
the symptoms are the same.
Any suggestions? I can offer remote access to the cluster if someone familiar with 
the networking code wants to take a closer look at this.
Alex Korobka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/