TCPv4 bad checksums

Jack (jschmid@udallas.edu)
Mon, 21 Dec 1998 10:50:21 -0600 (CST)


This is a lengthy me too post, to try and emphasize that
there does very probably exist a linux code error in the
serial/ppp/tcp code. The evidence for this is the seeming
independence on all external (to linux) factors. Since
this problem has been around since mid-October, I and my
friends have had a great deal of time to try and fix the
"hardware" problem.

Our specific symptom is the plethora of TCPv4 bad checksum
messages, sometimes resulting in stalled connections
[1]. Approximately 2-5% of all ip-packets over ppp0 are
TCPv4 bad checksums. The numbers are roughly 150k packets,
3k bad checksums per day. [2]

Assuming of course that hardware was at fault we reseated
connections, then replaced cables, then replaced modems,
then replaced motherboards. Of course we tried different
service providers (2 commercial and 1 university). We even
tried different cities and different phone companies. All
to no avail.

We then of course ran the other key part of a scientific
inquiry, the control. My reliable[3] MkLinux box downloaded
over a 28.8 at an impressive 3.0kBps and 3.4kBps with no
apparent stalls as per tcpdump.[4]

This leads me to conclude that the problem is either an
evil genius toying with my and my friends tcp checksums,
or is (however unlikely it may be) an error in the serial,
ppp, and tcp code interaction.

Details follow:

3 motherboard's affected (Soho, Asus, and some other one)

3 modems (usr sportster 33.6, usr internal 28.8, global
village 28.8)

5-10 phone cords

4 serial cables ( brand new belkin, 2 recent belkins, and a
noname suspect cable)

4 phone jacks at 2 seperate locations.

3 ISPs (800-be-a-geek,large local commercial ISP, and a
small-ish university)

2 phone companies: GTE, SWBell

Linux 2.1.127-2.1.131

Not affected:

MkLinux, same phone cord, same phone jack, both locations.

Local ethernet

PPP over local ethernet (using cotty[5] and telnet -E -8 -L)

This suggests that neither hardware, nor cabling, nor
ISP's, nor nearby routers, nor phone issues are the
culprit. The lack of a problem over ethernet suggests
that the tcp code is fine. The lack of a problem over
ethernet'd ppp suggests that the ppp code is fine. This
leaves the serial code to take the blame, though likely
in a complicated interaction.

I would reocmmend checknig boundary conditions on the
checksum code, it seems impossible that 327 packets in
a row could all be similarly flawed. Also perhaps ppp
updates the checksum upon receipt (I am unclear if it is
supposed to do this). Perhaps it does not properly update
the packet as it reaches the kernel, or perhaps it does not
update the checksum. Again the ppp over ethernet suggests
the conditions for this happening are odd. Can the serial
port interupt the tcp or ppp code? Perhaps the checksum
is not being performed atomically?

Also perhaps ip-options are being mishandled? I have not
seen any ip-options from any packet actually come to my
machine, but the tcp checksum is a bit bizarre and rather
loosely documented IMHO.

I am of course just learning the kernel, and thus my
opinions on the problem relate entirely to my own failure
to implement a working ip stack.

Possibly of interst: I have 8million serial interupts and
24hrs uptime. I have no idea if that is normal.

If I can provide any more information let me know, the
probably is easily repeatable, nay unavoidable.

-Jack

[1] When one sees that a host resends a packet 327
times, each time with a bad checksum, one wonders if
"noise" can account for this rather systematic error. It
seems odd that noise can disrupt 327 packets in a row
whilst leaving the sequence numbers and ack sequence
numbers completely intact.

[2] 3k checksums resulting in about 0.1 kernel
messages each, makes a sizable chunk of syslog. Luckily
they compress well.

[3] Reliably flakey. MkLinux is currently maintained
by a single foreign language major, who btw rocks the
house. I believe this is his first operating system to
single handedly support, maintain, enhance, and port to
new hardware.

[4] While the malformed packets do not appear to
be available to me or libpcap, the repeated nack's
are. Several hundred packets in one direction on a tcp
connection is a fairly accurate fingerprint of the problem.

[5] From the new firewall piercing mini-howto. The
howto has been quite updated, removing all information
on how to setup ppp over a telnet connection. Instead
one may now download his "solution" which is of course
self-documenting. The "self-documentation" consists
almost solely in configuration instructions rather than
in explaining how it works. The similarity to script kiddy
sploits was uncanny.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/