Sporadic ESP payload corruption when using IPSec in NAT-T Transport Mode

From: Evan Gilman
Date: Thu Jun 26 2014 - 17:14:10 EST


Hi all

We have a couple Ubuntu 10.04 hosts with kernel version 3.14.5 which
are experiencing TCP payload corruption when using IPSec in NAT-T
transport mode. All are running under Xen at third party providers.
When communicating with other hosts using IPSec, we see that these
corrupt TCP PDUs are still being received by the remote listener, even
though the TCP checksum is invalid.

All other checksums (IPSec authentication header and IP checksum) are
good. So, we are thinking that corruption is happening during the ESP
encapsulation and decapsulation phase (IPSec required for
reproduction). The corruption occurs sporadically, and we have not
found any one payload/packet combination that will reliably trigger
it, though we can typically reproduce it in less than 30 minutes. We
can do it very simply by reading from /dev/zero with dd and piping
through netcat. It occurs whenever a 3.14.5 kernel is involved at
either end of the conversation. I can send captures to those who are
interested. Does any of this sound familiar?

Steps and observations so far:
- tcpdump running on both sender and receiver
- ESP looks sane on the outside. TCP payload corruption can be seen
only after decryption
- Once reproduced, you may see only one or two problem packets come through
- Sometimes corruption is witnessed on the wire (suspected
encapsulation corruption)
- Sometimes corruption is _not_ witnessed on the wire, though the test
surfaces corruption (suspected decapsulation corruption)
- Corruption not witnessed over connections without a governing IPSec policy
- Corruption not witnessed after changing previously misbehaving hosts
to kernel version 2.6.32.

You can find the kernel config for the affected host here:
https://gist.github.com/evan2645/2c28d46e81d2b4c8f251

On another note, it seems the assumption that TCP payloads are safe
when encapsulated by ESP, and therefore the checksum need not be
verified, is a false one. It has certainly caused us a great deal of
pain. Is there a significant reason for bypassing TCP checksum
validation when using IPSec Transport Mode?

We are still trying to locate the exact spot in which the corruption
is occurring - any suggestions on how we could do that? We have not
seen this problem under Ubuntu 10.04 with kernel version 2.6.32.
Thanks in advance!
--
evan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/