Major network performance regression in 3.7

From: Willy Tarreau
Date: Sat Jan 05 2013 - 16:49:55 EST

Next message: Rafael J. Wysocki: "Re: [PATCH 0/6] ACPI / PM: ACPI power management update"
Previous message: Rafael J. Wysocki: "Re: [5/6] ACPI / PM: Move device power management functions to device_pm.c"
Next in thread: Eric Dumazet: "Re: Major network performance regression in 3.7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I'm observing multiple apparently unrelated network performance
issues in 3.7, to the point that I'm doubting it comes from the
network stack.

My setup involves 3 machines connected point-to-point with myri
10GE NICs (the middle machine has 2 NICs). The middle machine
normally runs haproxy, the other two run either an HTTP load
generator or a dummy web server :

[ client ] <--------> [ haproxy ] <--------> [ server ]

Usually transferring HTTP objects from the server to the client
via haproxy causes no problem at 10 Gbps for moderately large
objects.

This time I observed that it was not possible to go beyond 6.8 Gbps,
with all the chain idling a lot. I tried to change the IRQ rate, CPU
affinity, tcp_rmem/tcp_wmem, disabling flow control, etc... the usual
knobs, nothing managed to go beyond.

So I removed haproxy from the equation, and simply started the client
on the middle machine. Same issue. I thought about concurrency issues,
so I reduced to a single connection, and nothing changed (usually I
achieve 10G even with a single connection with large enough TCP windows).
I tried to start tcpdump and the transfer immediately stalled and did not
come back after I stopped tcpdump. This was reproducible several times
but not always.

So I first thought about an issue in the myri10ge driver and wanted to
confirm that everything was OK on the middle machine.

I started the server on it and aimed the client at it via the loopback.
The transfer rate was even worse : randomly oscillating between 10 and
100 MB/s ! Normally on the loop back, I get several GB/s here.

Running tcpdump on the loopback showed be several very concerning issues :

1) lots of packets are lost before reaching tcpdump. The trace shows that
these segments are ACKed so they're correctly received, but tcpdump
does not get them. Tcpdump stats at the end report impressive numbers,
around 90% packet dropped from the capture!

2) ACKs seem to be immediately delivered but do not trigger sending, the
system seems to be running with delayed ACKs, as it waits 40 or 200ms
before restarting, and this is visible even in the first round trips :

- connection setup :

18:32:08.071602 IP 127.0.0.1.26792 > 127.0.0.1.8000: S 2036886615:2036886615(0) win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
18:32:08.071605 IP 127.0.0.1.8000 > 127.0.0.1.26792: S 126397113:126397113(0) ack 2036886616 win 8030 <mss 65495,nop,nop,sackOK,nop,wscale 9>
18:32:08.071614 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126397114 win 16

- GET /?s=1g HTTP/1.0

18:32:08.071649 IP 127.0.0.1.26792 > 127.0.0.1.8000: P 2036886616:2036886738(122) ack 126397114 win 16

- HTTP/1.1 200 OK with the beginning of the response :

18:32:08.071672 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126397114:126401210(4096) ack 2036886738 win 16
18:32:08.071676 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126401210 win 250
==> 200ms pause here
18:32:08.275493 IP 127.0.0.1.8000 > 127.0.0.1.26792: P 126401210:126463006(61796) ack 2036886738 win 16
==> 40ms pause here
18:32:08.315493 IP 127.0.0.1.26792 > 127.0.0.1.8000: . ack 126463006 win 256
18:32:08.315498 IP 127.0.0.1.8000 > 127.0.0.1.26792: . 126463006:126527006(64000) ack 2036886738 win 16

... and so on

My server is using splice() with the SPLICE_F_MORE flag to send data.
I noticed that not using splice and relying on send(MSG_MORE) instead
I don't get the issue.

3) I wondered if this had something to do with the 64k MTU on the loopback
so I lowered it to 16kB. The performance was even worse (about 5MB/s).
Starting tcpdump managed to make my transfer stall, just like with the
myri10ge. In this last test, I noticed that there were some real drops,
because there were some SACKs :

18:45:16.699951 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 956153186:956169530(16344) ack 131668746 win 16
18:45:16.699956 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 956169530 win 64
18:45:16.904119 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957035762:957052106(16344) ack 131668746 win 16
18:45:16.904122 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957052106 win 703
18:45:16.904124 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957052106:957099566(47460) ack 131668746 win 16
18:45:17.108117 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957402550:957418894(16344) ack 131668746 win 16
18:45:17.108119 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957418894 win 1846
18:45:17.312115 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 957672806:957689150(16344) ack 131668746 win 16
18:45:17.312117 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 957689150 win 2902
18:45:17.516114 IP 127.0.0.1.8000 > 127.0.0.1.8002: P 958962966:958979310(16344) ack 131668746 win 16
18:45:17.516116 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 958979310 win 7941
18:45:17.516150 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959405614:959421958}>
18:45:17.516151 IP 127.0.0.1.8002 > 127.0.0.1.8000: . ack 959503678 win 9926 <nop,nop,sack 1 {959421958:959438302}>

Please note that the Myri card is running with the normal MTU of 1500,
jumbo frames were not used.

>From what I could test, only the Tx path seems affected, because after
rebooting the server on 3.5.7, I can transfer at 10 Gbps via the myri10ge
again to the 3.7.1 client. I tried to disable GSO, TSO, etc... but nothing
worked. Also, on 3.5.7, there are almost no drops between the kernel and
tcpdump.

I really suspect that all these issues are related. Maybe something about
send buffers recycling, I don't know. The fact that tcpdump gets very few
packets also makes me think about something about memory allocation. And
that could perhaps explain why starting it causes the traffic to stall.

Sadly, it will be hard to bisect this bug because I had a hard time running
3.7-rc up to rc5 due to several bugs that affected this workload (and were
fortunately fixed since).

I think it's better to only focus on the issue with the loopback since it's
easier to reproduce and involves less moving blocks.

I've just tried to remove the SPLICE_F_MORE flag on the sender but it does
not change anything.

If anyone has any idea of things to test to narrow the problem down, I'm
interested. I have not tried 3.8-rc2 yet, and reading the archives there
does not seem to be any similar reports. Note that I've been using 3.5.7
as a fallback, but the same lab has run 3.6.6 without any issues 2 months
ago, so I'm really confident that this is a 3.7 regression.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rafael J. Wysocki: "Re: [PATCH 0/6] ACPI / PM: ACPI power management update"
Previous message: Rafael J. Wysocki: "Re: [5/6] ACPI / PM: Move device power management functions to device_pm.c"
Next in thread: Eric Dumazet: "Re: Major network performance regression in 3.7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]