SYN trouble, hardware or software?

Chris Black (cblack@cmpteam4.unil.ch)
Wed, 22 Jul 1998 18:08:19 +0200


I am working on a distributed system which is currently running on a
seperate/private network with its own hub. It doesn't use PVM or MPI, it
just
uses temporary tcpip sockets between the master and the slave.
It is a typical beowulf configuration where the master has two ethernet
cards,
one which connects it to the rest of the net and another which connects
it to
the seperate cluster network.
We have recently started having network problems which cause the nodes
to be
unable to contact the master. This has become our worst vice, although
there
is a vice versa (har!), the master also fails to create sockets to the
nodes.
This happens occaisionally when jobs are running, and I have added
socket
retry code to our app which sometimes recovers, but sometimes does not
and
subsequently the master process running on the master never gets the
message
the node is trying to send.
I am trying to figure out if this is a hardware or software problem, I
believe it is a hardware problem, but before we change out the network
cards
and such, I would like to have a good reason for why I think it is the
hardware.
Kernel messages indicate that socket creation is failing due to an
uncompleted
handshake (I think). We get messages on the master suck as:
Jul 21 20:47:58 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.12 on 192.168.1.1:20817. Sending cookies.
Jul 21 20:53:13 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.7 on 192.168.1.1:20860. Sending cookies.
Jul 21 20:56:15 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.11 on 192.168.1.1:20885. Sending cookies.
Jul 21 20:57:40 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.7 on 192.168.1.1:20897. Sending cookies.
Jul 21 21:01:30 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.7 on 192.168.1.1:20931. Sending cookies.
Jul 21 21:04:20 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.9 on 192.168.1.1:20954. Sending cookies.
Jul 22 14:11:13 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.4 on 192.168.1.1:24225. Sending cookies.
Jul 22 15:23:00 isrec-insect kernel: Warning: possible SYN flood from
192.168.1.9 on 192.168.1.1:9393. Sending cookies.

(er, that "suck" should be "such", but "suck" also applies)

And on the slaves we get similar messages:
Jul 19 22:49:26 insect-01 kernel: Warning: possible SYN flood from
127.0.0.1 on 127.0.0.1:9027. Sending cookies.
Jul 19 23:11:32 insect-01 kernel: Warning: possible SYN flood from
127.0.0.1 on 127.0.0.1:9027. Sending cookies.
Jul 20 15:23:54 insect-01 kernel: Warning: possible SYN flood from
192.168.1.1 on 192.168.1.2:9027. Sending cookies.

Although the SYN messages on the master are understandably much more
frequent
than the ones on the slaves. I find it odd that even a socket to the
localhost
can generate incomplete SYN-initiated handshakes on the slaves. The
processes
that run on the slaves do use sockets to localhost to communicate to
processes
on the same host. This could be changed to be IPC shm or temporary
FIFOs, but
sockets to localhost are used for now.
No communication is done between the slaves, it is all master<->slave.

My understanding of tcpip and the SYN flood attacks is that to create a
socket, the client sends a "SYN" to the server. The server then sets up
the
socket structures and such, and sends back an ACK to the client. The
client
then completes the connection somehow (this may not be quite right, but
I
think I have the general idea of SYN).
So what seems to be happening (and as was forced in SYN flood attacks)
is that
the client sends a SYN, then the server tries to tell the client that
the
socket is ready, but the client never responds to the ACK to complete
the
connection.

So, does anyone have any information or advice? Is this surely a
hardware
problem with the (cheap) network cards, or could it be software?
I realize nobody likes troubleshooting/debugging, but any advice/info
would
be greatly appreciated.

More info follows:
How a job is started: The master process running on the master sends out
a
message to each node giving them a commandline to execute the
computational
part of the job (much like rsh, but it is a daemon I wrote for this
purpose).
The slave executes the computational program reading the data from an
NFS file
and dumping output to a local /tmp file. When it is finished, it
notifies the
master on another socket and passes the results back. The master then
collects
the results from all the slaves and unifies them.
Note that the traffic is actually pretty minimal and only occurs at the
beginning and end of the job. Jobs take from 20 seconds to 20+ hours and
sometimes many small jobs are run sequentially from a script.
The network failures seem to occur only with the actual temporary
sockets used
by our system, and not NFS, but I don't really know for sure or know the
best
way to find out.

Hardware:
master:
cyrix 6x86L p200+ with 128MB RAM, and EIDE disk. eth0 speaks to the
external
network and is a pci tulip based card. The card that speaks to the
internal
(cluster) network is a 16-bit ISA addtron ne2k clone.
nodes:
15 nodes, each with a cyrix 6x86L p200+, 32MB RAM, an EIDE disk, and an
addtron 16-bit ISA ne2k clone.
network: 10base-T network with a 16-port hub (not a switch).

Software:
redhat linux 4.2, kernel 2.0.35 with "Jumbo-4" cyrix/cpuid patch.
The network part of our software is written in perl and uses the perl
IO::Socket library.
The computational part of our software is mostly protein profile
search/alignment and genetic sequence search software natively compiled
from
fortran.
(being pattern match-type stuff, we don't need strong FPUs)
The biological databases are split up into chunks and distributed to the
slaves periodically, so the searches read their part of the databases
from
the local disk. This also means we don't have to distribute all this
data over
the network for each job.

Thanks,
Chris Black

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html