Re: Bug report: UDP ~20% degradation

From: Linux regression tracking #adding (Thorsten Leemhuis)
Date: Fri Feb 10 2023 - 13:38:10 EST


[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 08.02.23 12:08, Tariq Toukan wrote:
>
> Our performance verification team spotted a degradation of up to ~20% in
> UDP performance, for a specific combination of parameters.
>
> Our matrix covers several parameters values, like:
> IP version: 4/6
> MTU: 1500/9000
> Msg size: 64/1452/8952 (only when applicable while avoiding ip
> fragmentation).
> Num of streams: 1/8/16/24.
> Num of directions: unidir/bidir.
>
> Surprisingly, the issue exists only with this specific combination:
> 8 streams,
> MTU 9000,
> Msg size 8952,
> both ipv4/6,
> bidir.
> (in unidir it repros only with ipv4)
>
> The reproduction is consistent on all the different setups we tested with.
>
> Bisect [2] was done between these two points, v5.19 (Good), and v6.0-rc1
> (Bad), with ConnectX-6DX NIC.
>
> c82a69629c53eda5233f13fc11c3c01585ef48a2 is the first bad commit [1].
>
> We couldn't come up with a good explanation how this patch causes this
> issue. We also looked for related changes in the networking/UDP stack,
> but nothing looked suspicious.
>
> Maybe someone here can help with this.
> We can provide more details or do further tests/experiments to progress
> with the debug.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced c82a69629c53eda5233f13fc11c3c01585ef48a
#regzbot title sched/fair: UDP ~20% degradation
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

> [1]
> commit c82a69629c53eda5233f13fc11c3c01585ef48a2
> Author: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Date:   Fri Jul 8 17:44:01 2022 +0200
>
>     sched/fair: fix case with reduced capacity CPU
>
>     The capacity of the CPU available for CFS tasks can be reduced
> because of
>     other activities running on the latter. In such case, it's worth
> trying to
>     move CFS tasks on a CPU with more available capacity.
>
>
>
>
>     The rework of the load balance has filtered the case when the CPU is
>
>     classified to be fully busy but its capacity is reduced.
>
>
>
>
>
>
>     Check if CPU's capacity is reduced while gathering load balance
> statistic
>
>     and classify it group_misfit_task instead of group_fully_busy so we can
>
>     try to move the load on another CPU.
>
>
>
>
>
>
>     Reported-by: David Chen <david.chen@xxxxxxxxxxx>
>
>
>     Reported-by: Zhang Qiao <zhangqiao22@xxxxxxxxxx>
>
>
>     Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>
>
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>
>
>     Tested-by: David Chen <david.chen@xxxxxxxxxxx>
>
>
>     Tested-by: Zhang Qiao <zhangqiao22@xxxxxxxxxx>
>
>
>     Link:
> https://lkml.kernel.org/r/20220708154401.21411-1-vincent.guittot@xxxxxxxxxx
>
>
>
> [2]
>
> Detailed bisec steps:
>
> +--------------+--------+-----------+-----------+
> | Commit       | Status | BW (Gbps) | BW (Gbps) |
> |              |        | run1      | run2      |
> +--------------+--------+-----------+-----------+
> | 526942b8134c | Bad    | ---       | ---       |
> +--------------+--------+-----------+-----------+
> | 2e7a95156d64 | Bad    | ---       | ---       |
> +--------------+--------+-----------+-----------+
> | 26c350fe7ae0 | Good   | 279.8     | 281.9     |
> +--------------+--------+-----------+-----------+
> | 9de1f9c8ca51 | Bad    | 257.243   | ---       |
> +--------------+--------+-----------+-----------+
> | 892f7237b3ff | Good   | 285       | 300.7     |
> +--------------+--------+-----------+-----------+
> | 0dd1cabe8a4a | Good   | 305.599   | 290.3     |
> +--------------+--------+-----------+-----------+
> | dfea84827f7e | Bad    | 250.2     | 258.899   |
> +--------------+--------+-----------+-----------+
> | 22a39c3d8693 | Bad    | 236.8     | 245.399   |
> +--------------+--------+-----------+-----------+
> | e2f3e35f1f5a | Good   | 277.599   | 287       |
> +--------------+--------+-----------+-----------+
> | 401e4963bf45 | Bad    | 250.149   | 248.899   |
> +--------------+--------+-----------+-----------+
> | 3e8c6c9aac42 | Good   | 299.09    | 294.9     |
> +--------------+--------+-----------+-----------+
> | 1fcf54deb767 | Good   | 292.719   | 301.299   |
> +--------------+--------+-----------+-----------+
> | c82a69629c53 | Bad    | 254.7     | 246.1     |
> +--------------+--------+-----------+-----------+
> | c02d5546ea34 | Good   | 276.4     | 294       |
> +--------------+--------+-----------+-----------+