Re: Major network performance regression in 3.7

From: Eric Dumazet
Date: Sat Jan 05 2013 - 18:18:39 EST

Next message: Dmitry Torokhov: "Re: [PATCH v2 2/4] input: keyboard: tegra: use devm_* for resourceallocation"
Previous message: Ruth Yoda: "Re:Very Urgent!!!"
In reply to: Willy Tarreau: "Major network performance regression in 3.7"
Next in thread: Willy Tarreau: "Re: Major network performance regression in 3.7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, 2013-01-05 at 22:49 +0100, Willy Tarreau wrote:
> Hi,
>
> I'm observing multiple apparently unrelated network performance
> issues in 3.7, to the point that I'm doubting it comes from the
> network stack.
>
> My setup involves 3 machines connected point-to-point with myri
> 10GE NICs (the middle machine has 2 NICs). The middle machine
> normally runs haproxy, the other two run either an HTTP load
> generator or a dummy web server :
>
>
> [ client ] <--------> [ haproxy ] <--------> [ server ]
>
> Usually transferring HTTP objects from the server to the client
> via haproxy causes no problem at 10 Gbps for moderately large
> objects.
>
> This time I observed that it was not possible to go beyond 6.8 Gbps,
> with all the chain idling a lot. I tried to change the IRQ rate, CPU
> affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
> knobs, nothing managed to go beyond.
>
> So I removed haproxy from the equation, and simply started the client
> on the middle machine. Same issue. I thought about concurrency issues,
> so I reduced to a single connection, and nothing changed (usually I
> achieve 10G even with a single connection with large enough TCP windows).
> I tried to start tcpdump and the transfer immediately stalled and did not
> come back after I stopped tcpdump. This was reproducible several times
> but not always.
>
> So I first thought about an issue in the myri10ge driver and wanted to
> confirm that everything was OK on the middle machine.
>
> I started the server on it and aimed the client at it via the loopback.
> The transfer rate was even worse : randomly oscillating between 10 and
> 100 MB/s ! Normally on the loop back, I get several GB/s here.
>
> Running tcpdump on the loopback showed be several very concerning issues :
>
> 1) lots of packets are lost before reaching tcpdump. The trace shows that
> these segments are ACKed so they're correctly received, but tcpdump
> does not get them. Tcpdump stats at the end report impressive numbers,
> around 90% packet dropped from the capture!
>
> 2) ACKs seem to be immediately delivered but do not trigger sending, the
> system seems to be running with delayed ACKs, as it waits 40 or 200ms
> before restarting, and this is visible even in the first round trips :
>
> - connection setup :
>
> 18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
> 18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
> 18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16
>
> - GET /?s=1g HTTP/1.0
>
> 18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 2036886616:2036886738(122) ack 126397114 win 16
>
> - HTTP/1.1 200 OK with the beginning of the response :
>
> 18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126397114:126401210(4096) ack 2036886738 win 16
> 18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 250
> ==> 200ms pause here
> 18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 126401210:126463006(61796) ack 2036886738 win 16
> ==> 40ms pause here
> 18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 256
> 18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126463006:126527006(64000) ack 2036886738 win 16
>
> ... and so on
>
> My server is using splice() with the SPLICE_F_MORE flag to send data.
> I noticed that not using splice and relying on send(MSG_MORE) instead
> I don't get the issue.
>
> 3) I wondered if this had something to do with the 64k MTU on the loopback
> so I lowered it to 16kB. The performance was even worse (about 5MB/s).
> Starting tcpdump managed to make my transfer stall, just like with the
> myri10ge. In this last test, I noticed that there were some real drops,
> because there were some SACKs :
>
> 18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 956153186:956169530(16344) ack 131668746 win 16
> 18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
> 18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957035762:957052106(16344) ack 131668746 win 16
> 18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
> 18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957052106:957099566(47460) ack 131668746 win 16
> 18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957402550:957418894(16344) ack 131668746 win 16
> 18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 1846
> 18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957672806:957689150(16344) ack 131668746 win 16
> 18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 2902
> 18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 958962966:958979310(16344) ack 131668746 win 16
> 18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win 7941
> 18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959405614:959421958}>
> 18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959421958:959438302}>
>
> Please note that the Myri card is running with the normal MTU of 1500,
> jumbo frames were not used.
>
> From what I could test, only the Tx path seems affected, because after
> rebooting the server on 3.5.7, I can transfer at 10 Gbps via the myri10ge
> again to the 3.7.1 client. I tried to disable GSO, TSO, etc... but nothing
> worked. Also, on 3.5.7, there are almost no drops between the kernel and
> tcpdump.
>
> I really suspect that all these issues are related. Maybe something about
> send buffers recycling, I don't know. The fact that tcpdump gets very few
> packets also makes me think about something about memory allocation. And
> that could perhaps explain why starting it causes the traffic to stall.
>
> Sadly, it will be hard to bisect this bug because I had a hard time running
> 3.7-rc up to rc5 due to several bugs that affected this workload (and were
> fortunately fixed since).
>
> I think it's better to only focus on the issue with the loopback since it's
> easier to reproduce and involves less moving blocks.
>
> I've just tried to remove the SPLICE_F_MORE flag on the sender but it does
> not change anything.
>
> If anyone has any idea of things to test to narrow the problem down, I'm
> interested. I have not tried 3.8-rc2 yet, and reading the archives there
> does not seem to be any similar reports. Note that I've been using 3.5.7
> as a fallback, but the same lab has run 3.6.6 without any issues 2 months
> ago, so I'm really confident that this is a 3.7 regression.
>
> Regards,
> Willy
>
> --

Hi Willy, another good finding during the week end ! ;)

1) This looks like interrupts are spreaded on multiple cpus, and this
give Out Of Order problems with TCP stack.

2) Another possibility would be that Myri card/driver doesnt like very
well high order pages.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Dmitry Torokhov: "Re: [PATCH v2 2/4] input: keyboard: tegra: use devm_* for resourceallocation"
Previous message: Ruth Yoda: "Re:Very Urgent!!!"
In reply to: Willy Tarreau: "Major network performance regression in 3.7"
Next in thread: Willy Tarreau: "Re: Major network performance regression in 3.7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]