[BUG REPORT] TCP/IP weirdness in 2.2.15

From: Stephen Landamore (stephenl@zeus.com)
Date: Thu Jun 08 2000 - 04:28:47 EST


Hi Folks,

I'm seeing some very strange behaviour here with TCP/IP connections
(or at least, it looks strange to me) - under heavy load, it appears
connections are spuriously dropped.

Bear with me while I give a little background...

...I'm actually trying to perform some SPECweb99 benchmarking; I have
a test bed of 5 clients and 1 server. For those of you who are not
familiar with SPECweb99, let's just say that it tries to simulate
'real world' traffic - a mixture of HTTP GET, POST, static and dynamic
GET, keepalive connections, non-keepalive connections... basically
hammer the server as hard as you can until it grinds to a halt. Each
client has one controlling (parent) process which forks off N children
to generate the HTTP load. One client is a designated 'prime client'
which synchronises the rest of the clients. If you're interested in
more SPEC info, there's a very informative whitepaper on their
website, have a look at http://www.spec.org/osg/web99

The "dropped connection" manifests itself by the SPEC run 'hanging' -
at the end of the test there is one or more child processes on one or
more clients stuck in a read(2), trying to read from a
socket. (usually, one child process on one client).

If I do a netstat on the client I see an ESTABLISHED connection:

[stephenl@spec1 stephenl]$ netstat -t
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
[...]
tcp 0 0 spec1.specweb.zeus:6426 spec6.specweb.zeus.:www ESTABLISHED
[...]
[stephenl@spec1 stephenl]$

(spec1 is the client, spec6 is the server. As it happens, spec1 is
also my prime client - this is just a coincidence in this particular
lockup)

If I hop over to the server and do a netstat there, I
see... nothing. No ESTABLISHED connection, no TIME_WAIT, no
CLOSE_WAIT. Nothing.

Question: is this valid? (I think it may be, if the server
crashed. Which it didn't - all machines have enjoyed a nice long
uptime :-)

I've done more digging. I captured tcpdumps of the entire SPEC run, on
all machines. I added lots of extra debugging logs to the client code,
I ran 'netstat -tn' every 10 seconds on all machines for the duration
of the test. Oh yes, and the webserver logs.

Here is what I've found...

Debugging logfile
-----------------
The client creates a socket, binds a port to it and connects to the
webserver. It write()s a HTTP GET request, and read()s a file back. So
far so good. Without closing the socket, another request is sent and
another file is received (on keepalives, there are a random number of
HTTP requests performed - on average, 10). Now here is where it gets
interesting... still without closing the socket, write() another HTTP
request and go to read() the file... the read() sits there and this
child is effectively dead. I've observed it to hang on anywhere from
the 2nd request to the 7th or 8th request.

tcpdump logs
------------
This matches perfectly what I've just detailed above... I see all the
previous requests on this port, I see the server send the previous
file, I see the client ACK this, I see the client send the final,
fatal GET... I see the server ACK this. I see this on both the client
AND server tcpdump logfiles - they both perfectly match. After the
final, fatal ACK I see... nothing. The client never sends any more
packets, the server never sends any more packets. Specifically I see
no FIN from either end.

webserver logs
--------------
I see all requests from the offending client up to and including the
previous request. The offending GET never makes it to the server logs.

netstat logs
------------
Running a 'netstat -tn' every 10 seconds on the client, I see the
connection spring into existence somewhere through the test (sometimes
near the beginning of the test, sometimes near the end. On average
about halfway through, on a 10-minute run). Once the connection has
been made, it is reported as 'ESTABLISHED' until I terminate the stuck
process... Running a 'netstat -tn' every 10 seconds on the server, I
see the connection spring into existence at the same time; I see the
same connection 10 seconds later; 20 seconds later it has vanished. No
CLOSE, no CLOSE_WAIT. Not, even, a TIME_WAIT. According to netstat,
the connection on the server simply vanished. I'm *sure* this is just
plain wrong - but it's what I see.

misc logs
---------
No error messages in the webserver log; nothing in 'dmesg', nothing in
/var/spool/*

---

I've searched the archives best as I can but couldn't find anything relevant (actually I've been loosely following linux-kernel for the past 2 years, and I would expect to remember if I'd seen this before, but I'm sure better minds can advise!)

Please help me folks, I've spent hours tracking it down this far, and it's starting to wear me down :-/

I'd really appreciate info on how to go about debugging from here... or a pointer as to what I'm doing wrong (am I doing something wrong?)

---

I'm deliberately not going to include .config files etc here - this mail is getting long enough as it is. Of course all information is available - just ask. But I will give a concise system description:

All clients are identical, Celeron 500's with IDE subsystem and twin EEPRO10/100 NICs (actually I'm only using 1 NIC for my tests at present).

The server is basically the same as the client, except it has a PIII 700.

All systems were originally installed with RedHat 6.0, the only thing that has changed is now ALL machines have linux kernel 2.2.15 (and I hoped that would fix my problems :-)

Webserver software is Zeus 3.3.6.

The server has had /proc/sys/fs/{file,inode}-max raised to cope with the big fileset (and the PAM limits raised)

That's all I can think of for now, so I'll stop typing here. Any feedback would be deeply appreciated,

Best regards, Stephen

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 21:00:15 EST