Re: After many hours all outbound connections get stuck in SYN_SENT

From: James Nichols
Date: Thu Dec 20 2007 - 11:38:00 EST


> But I'd be very surprised if the router is acting as anything more
> that a network-layer device. It might perhaps have some soft connection
> state being used for generating accounting records. Being Cisco
> it's probably a switch-router, so it might carry some per-port hard
> state for validating source IP addresses and ARPs on each port.
>
> The firewall is much more likely to be carrying per-flow Sack
> state. The Cisco PIX had a bug with SACK handling (CSCse14419,
> fixed in 7.0(7), 7.1(2.34), 7.2(2.2), 8.0(0.141) but perhaps it
> has regressed). A simple trace either side of the firewall will
> show the inconsistency between the TCP sequence number (which
> gets randomised) and the Sack sequence number (which didn't).
> You could disable the TCP Sequence Number Randomisation feature
> and see if the fault reoccurs.

I do have TCP Sequence # Randomization enabled on my router. However,
if this was causing an issue, wouldn't it always occur and cause
connection issues, not just after 38 hours of correct operation? I
can look into turning this off, but I'll likely have to jump through
several hoops which will be challenging if I don't have a very clear
definitive reason why this is causing this issue. Plus, I've had this
problem with at least 2 other sets of network switches over the past 4
years. I'm actually running 7.0(6), which doesn't have the fix you
mentioned. If it really is possible that this issue wouldn't always
cause problems, but only after hours of succesful operation, then I
could probably motivate the upgrade. I can try to setup a trace, but
this is a lot of work for other people in my organization, so it will
take quite some time.


> You'd probably should also investigate the Linux kernel,
> especially the size and locks of the components of the Sack data
> structures and what happens to those data structures after Sack is
> disabled (presumably the Sack data structure is in some unhappy
> circumstance, and disabling Sack allows the data to be discarded,
> magically unclaging the box).
>
> In the absence of the reporter wanting to dump the kernel's
> core, how about a patch to print the Sack datastructure when
> the command to disable Sack is received by the kernel?
> Maybe just print the last 16b of the IP address?

Given the fact that I've had this problem for so long, over a variety
of networking hardware vendors and colo-facilities, this really sounds
good to me. It will be challenging for me to justify a kernel core
dump, but a simple patch to dump the Sack data would be do-able.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/