On Tue, Apr 1, 2008 at 7:59 PM, Gabriel Barazer <gabriel@xxxxxxxx> wrote:On 04/01/2008 7:17:31 PM +0200, Leo <neleo@xxxxxxx> wrote:
> H. Willstrand wrote:
>> On Tue, Apr 1, 2008 at 5:43 PM, Gabriel Barazer <gabriel@xxxxxxxx> wrote:
>>
>>> On 04/01/2008 4:43:20 PM +0200, Brett Paden <paden@xxxxxxxxxxxx> wrote:
>>> >> If I'm right Brett's problem relays in the test client (provided in
>>> >> the first mail). This has probably to do with the number of ports
>>> >> opened and closed during a short time period.
>>> >
>>> > My test client is designed to simulate the sort of load our
>>> production
>>> > databases and web servers see. We're talking on the order of 100-400
>>> > connections per second. On an unloaded server the 3000ms occur right
>>> > around 400 connections a second but we have seen them a lower
>>> connection
>>> > rates. Are you suggesting that we could do something simple (like
>>> reap
>>> > TIME_WAIT connections) to allevaite the problem?
>>>
>>> Using tcp_tw_recycle / tcp_tw_reuse doesn't solve the problem either on
>>> the client nor on the server. I tested with and without these options
>>> enabled, disabled netfilter's connection tracking and none solved this
>>> delay. If even the "lo" interface is concerned, there is definitely
>>> something into the network stack and not the device drivers.
>>>
>>> Here is a thread I started on LKML about this very same bug.
>>> http://lkml.org/lkml/2008/3/14/353
>>> There is a forum thread with french hosting providers talking about it.
>>> (if some of you read french:
>>> http://www.webmasterclub.fr/forum/topic,59486,0.html)
>>>
>>> We are far from being alone!
>>>
> Welcome to the club, Gabriel!
>>> Gabriel
How lucky I am!
I suspect there are many other people having this problem out there,
they just don't notice these delays on small infrastructures and because
this bug doesn't actually cause a connection error, but "only" an
unacceptable delay for moderate to high busy servers.
>> Ok, seams to be the same issue that Leo has (has nothing to do with
>> the Brett / Marlon issue, only common dominator is the 3000ms).
>>
> But Gabriel is also talking about 3 second timeouts on the client as
> Brett and I did. I have read Gabriel's description on the provided link
> and it seems to be exactly the same problem. I think Brett can confirm
> this ...
>> This issue is probably caused by server delivering as miscalculated
>> SYN/ACK (the acked number is miscalculated, see my second mail).
>>
> When you look at my first tcpdump with two machines as server and client
> then you can see that there are no miscalculated SYN/ACK packets from
> the server (and therefore no RST packet from the client). All packets
> have the right number but the client never receives the SYN/ACK packet
> from the server. Only at the lo test there are RST packets and wrong
> packet numbers. But as I told you in my last email I think this is a
> different problem and not important for us. We should ignore the lo test
> and concentrate on the "real" problem of Brett, Gabriel and myself (and
> even a lot of other people out there).
I confirm that there is no problem is the sequence numbers. Attached is
the pcap compatible capture of the relevant packets (608 bytes, 6
packets total: 2 for the failed handshake, 3 for the successful one and
1 for the first mysql data packet). This capture has been filtered to
show only the relevant packets and done in promiscuous mode.
I'm missing the tcpdump...
Attachment:
tcp-3sec-bug.pcap
Description: Binary data