Re: TCP connection issues against Amazon S3

From: Yuchung Cheng
Date: Tue Jan 06 2015 - 13:34:04 EST


On Tue, Jan 6, 2015 at 10:17 AM, Erik Grinaker <erik@xxxxxxxxxx> wrote:
>
>> On 06 Jan 2015, at 17:20, Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
>> On Tue, 2015-01-06 at 16:11 +0000, Erik Grinaker wrote:
>>>> On 06 Jan 2015, at 16:04, Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
>>>> On Tue, 2015-01-06 at 15:14 +0000, Erik Grinaker wrote:
>>>>> (CCing Yuchung, as his name comes up in the relevant commits)
>>>>>
>>>>> After upgrading from Ubuntu 12.04.5 to 14.04.1 we have begun seeing
>>>>> intermittent TCP connection hangs for HTTP image requests against
>>>>> Amazon S3. 3-5% of requests will suddenly stall in the middle of the
>>>>> transfer before timing out. We see this problem across a range of
>>>>> servers, in several data centres and networks, all located in Norway.
>>>>>
>>>>> A packet dump [1] shows repeated ACK retransmits for some of the
TCP does not retransmit ACK ... do you mean DUPACKs sent by the receiver?

I am trying to understand the problem. Could you confirm that it's the
HTTP responses sent from Amazon S3 got stalled, or HTTP requests sent
from the receiver (your host)?

btw I suspect some middleboxes are stripping SACKOK options from your
SYNs (or Amazon SYN-ACKs) assuming Amazon supports SACK.



>>>>> requests. Using Ubuntu mainline kernels, we found the problem to have
>>>>> been introduced between 3.11.10 and 3.12.0, possibly in
>>>>> 0f7cc9a3c2bd89b15720dbf358e9b9e62af27126. The problem is also present
>>>>> in 3.18.1. Disabling tcp_window_scaling seems to solve it, but has
>>>>> obvious drawbacks for transfer speeds. Other sysctls do not seem to
>>>>> affect it.
>>>>>
>>>>> I am not sure if this is fundamentally a kernel bug or a network
>>>>> issue, but we did not see this problem with older kernels.
>>>>>
>>>>> [1] http://abstrakt.bengler.no/tcp-issues-s3.pcap.bz2
>>>>
>>>>
>>>> CC netdev
>>>>
>>>> This looks like the bug we fixed here :
>>>>
>>>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=39bb5e62867de82b269b07df900165029b928359
>>>
>>> Has that patch gone into a release? Because the problem persists with 3.18.1.
>>
>> Patch is in 3.18.1 yes.
>>
>> So thats a separate issue.
>>
>> Can you confirm pcap was taken at receiver (195.159.221.106), not sender
>> (54.231.136.74) , and on which host is running the 'buggy kernel' ?
>
> Yes, pcap was taken on receiver (195.159.221.106).
>
>> If the sender is broken, changing the kernel on receiver wont help.
>>
>> BTW not using sack (on 54.231.132.98) is terrible for performance in
>> lossy environments.
>
> It may well be that the sender is broken; however, the sender is Amazon S3, so I do not have any control over it. And in any case, the problem goes away with 3.11.10 on receiver, but persists with 3.12.0 (or later) on receiver, so there must be some change in 3.12.0 which has caused this to trigger.
>
> If you are confident that the problem is with Amazon, I can get in touch with their engineering department.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/