TCP bug? hanging outgoing connections in 2.2.14

From: Jeremy Fitzhardinge (jeremy@goop.org)
Date: Wed Feb 16 2000 - 16:33:49 EST


Hi,

I'm seeing a very strange thing:

- postfix, my MTA, is trying to deliver some mail via smtp. It connects
  fine (according to tcpdump), but the select syscall just times out. If I
  leave it like this, it logs a failure from "read timeout" after 5 mins (the
  select timeout).
- if I manually telnet to the same host and port while the delivery is trying
  to connect, the smtp delivery select completes, but the following
  read returns EOF. The mail delivery is logged as failing with "server
  dropped connection"

It only happens with *one* site: csla.CSL.sri.com. Postfix has been running
here for about 8 months, and this is the only problem host I've seen (though
come to think of it, there's one other host which may have similar symptoms).

I can't tell whether its a kernel bug, a postfix bug or a problem with the
remote host. Given that the remote host is the home of comp-risks, I'd be
surprised if it were strange in any way. And postfix doesn't look like its
doing anything strange at all.

The kernel is 2.2.14, running on my gateway machine (2 interfaces); host is
also running masqerading and ipchains firewall stuff. The most strange thing
running on this system is FreeSwan; I just reconfigured it out, and the problem
remains. I also saw this in 2.2.12, but I updated to see if 2.2.14 helped; it
didn't. (And it killed my ~120 day uptime)

Strace on 'smtp', the smtp delivery daemon shows:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
connect(9, {sin_family=AF_INET, sin_port=htons(25),
sin_addr=inet_addr("192.12.33.2 ")}, 16
select(10, [9], NULL, [9], {300, 0} <hang...>

and the corrsponding tcpdump shows:

12:49:20.562071 gw.goop.org.1031 > csla.csl.sri.com.smtp: S
1693701511:1693701511(0) win 32120 <mss 1460,sackOK,timestamp 22854
3[|tcp]> (DF) (ttl 64, id 4107)
12:49:20.585798 csla.csl.sri.com.smtp > gw.goop.org.1031: S
570048000:570048000(0) ack 1693701512 win 4096 (ttl 50, id 21364)
12:49:20.585941 gw.goop.org.1031 > csla.csl.sri.com.smtp: . ack 1 win 32120
(DF) (ttl 64, id 4111)
<pause...>

Everything is quiet. Then I type:

$ telnet csla.csl.sri.com smtp
Trying 192.12.33.2...
Connected to csla.csl.sri.com.
Escape character is '^]'.

And meanwhile, the strace on 'smtp' shows:
) = 1 (in [9], left {274, 470000}) <select finishes>
read(9, "", 4096) = 0
close(9) = 0
time([950734186]) = 950734186
getpid() = 787
sigaction(SIGPIPE, {0x400a2430, [], 0}, {SIG_IGN}) = 0
send(5, "<22>Feb 16 12:49:46 postfix/smtp[787]: connect to
csla.CSL.sri.com[192.12.
33.2]: server dropped connection (port 25)\0", 117, 0) = 117
...

and tcpdump has (this is both the telnet and the smtp delivery together)

12:49:45.617582 gw.goop.org.1032 > csla.csl.sri.com.smtp: S
1734135925:1734135925(0
) win 32120 <mss 1460,sackOK,timestamp 231049[|tcp]> (DF) [tos 0x10] (ttl 64,
id 41
87)
12:49:45.646624 csla.csl.sri.com.smtp > gw.goop.org.1032: S
573312000:573312000(0)
ack 1734135926 win 4096 (ttl 50, id 21597)
12:49:45.646750 gw.goop.org.1032 > csla.csl.sri.com.smtp: . ack 1 win 32120
(DF) [t
os 0x10] (ttl 64, id 4189)
12:49:46.116687 gw.goop.org.1031 > csla.csl.sri.com.smtp: . ack 2 win 32696
(DF) (ttl 64, id 4192)
12:49:46.118020 gw.goop.org.1031 > csla.csl.sri.com.smtp: F 1:1(0) ack 2 win
32696
(DF) (ttl 64, id 4193)
12:49:46.149577 csla.csl.sri.com.smtp > gw.goop.org.1031: . ack 2 win 4096 (ttl
50,
 id 21685)

type "quit" to telnet

12:49:52.097921 gw.goop.org.1032 > csla.csl.sri.com.smtp: P 1:7(6) ack 1 win
32696 (DF) [tos 0x10] (ttl 64, id 4233)
12:49:52.118327 csla.csl.sri.com.smtp > gw.goop.org.1032: P 1:42(41) ack 7 win
4096 (ttl 50, id 21720)
12:49:52.118467 gw.goop.org.1032 > csla.csl.sri.com.smtp: . ack 42 win 32696
(DF) [ tos 0x10] (ttl 64, id 4236)
12:49:52.119047 csla.csl.sri.com.smtp > gw.goop.org.1032: F 42:42(0) ack 7 win
4096 (ttl 50, id 21721)
12:49:52.119148 gw.goop.org.1032 > csla.csl.sri.com.smtp: . ack 43 win 32695
(DF) [ tos 0x10] (ttl 64, id 4237)
221 csla.csl.sri.com closing connection
12:49:52.120425 gw.goop.org.1032 > csla.csl.sri.com.smtp: F 7:7(0) ack 43 win
32696 (DF) [tos 0x10] (ttl 64, id 4238)
Connection closed by foreign host.

Any ideas?

Thanks,
        J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Feb 23 2000 - 21:00:17 EST