Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

From: Richard Gobert
Date: Wed Mar 22 2023 - 15:33:56 EST


> On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <pabeni@xxxxxxxxxx> wrote:
> >
> > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > Currently the IPv6 extension headers are parsed twice: first in
> > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > >
> > > By using the new ->transport_proto field, and also storing the size of the
> > > network header, we can avoid parsing extension headers a second time in
> > > ipv6_gro_complete (which saves multiple memory dereferences and conditional
> > > checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> > > IPv6 packets).
> > >
> > > The implementation had to handle both inner and outer layers in case of
> > > encapsulation (as they can't use the same field). I've applied a similar
> > > optimisation to Ethernet.
> > >
> > > Performance tests for TCP stream over IPv6 with a varying amount of
> > > extension headers demonstrate throughput improvement of ~0.7%.
> >
> > I'm surprised that the improvement is measurable: for large aggregate
> > packets a single ipv6_exthdrs_len() call is avoided out of tens calls
> > for the individual pkts. Additionally such figure is comparable to
> > noise level in my tests.

It's not simple but I made an effort to make a quiet environment.
Correct configuration allows for this kind of measurements to be made
as the test is CPU bound and noise is a variance that can be reduced with
enough samples.

Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
gen)

# power-management and hyperthreading disabled in BIOS
# sysctl preallocate net mem
echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable turboboost
ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE frames

# Single core performance
for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0 >"$x"; done

./network-testing-master/bin/netfilter_unload_modules.sh 2>/dev/null # unload netfilter
tuned-adm profile latency-performance
cpupower frequency-set -f 2200MHz # Set core to specific frequency
systemctl isolate rescue-ssh.target
# and kill all processes besides init

> > This adds a couple of additional branches for the common (no extensions
> > header) case.

The additional branch in ipv6_gro_receive would be negligible or even
non-existent for a branch predictor in the common case
(non-encapsulated packets).
I could wrap it with a likely macro if you wish.
Inside ipv6_gro_complete a couple of branches are saved for the common
case as demonstrated below.

original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):

// if (skb->encapsulation)

ffffffff81c4962b: f6 87 81 00 00 00 20 testb $0x20,0x81(%rdi)
ffffffff81c49632: 74 2a je ffffffff81c4965e <ipv6_gro_complete+0x3e>

...

// nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);

ffffffff81c4969c: eb 1b jmp ffffffff81c496b9 <ipv6_gro_complete+0x99> <-- jump to beginning of for loop
ffffffff81c4968e: b8 28 00 00 00 mov $0x28,%eax
ffffffff81c49693: 31 f6 xor %esi,%esi
ffffffff81c49695: 48 c7 c7 c0 28 aa 82 mov $0xffffffff82aa28c0,%rdi
ffffffff81c4969c: eb 1b jmp ffffffff81c496b9 <ipv6_gro_complete+0x99>
ffffffff81c4969e: f6 41 18 01 testb $0x1,0x18(%rcx)
ffffffff81c496a2: 74 34 je ffffffff81c496d8 <ipv6_gro_complete+0xb8> <--- 3rd conditional check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
ffffffff81c496a4: 48 98 cltq
ffffffff81c496a6: 48 01 c2 add %rax,%rdx
ffffffff81c496a9: 0f b6 42 01 movzbl 0x1(%rdx),%eax
ffffffff81c496ad: 0f b6 0a movzbl (%rdx),%ecx
ffffffff81c496b0: 8d 04 c5 08 00 00 00 lea 0x8(,%rax,8),%eax
ffffffff81c496b7: 01 c6 add %eax,%esi
ffffffff81c496b9: 85 c9 test %ecx,%ecx <--- for loop starts here
ffffffff81c496bb: 74 e7 je ffffffff81c496a4 <ipv6_gro_complete+0x84> <--- 1st conditional check: proto != NEXTHDR_HOP
ffffffff81c496bd: 48 8b 0c cf mov (%rdi,%rcx,8),%rcx
ffffffff81c496c1: 48 85 c9 test %rcx,%rcx
ffffffff81c496c4: 75 d8 jne ffffffff81c4969e <ipv6_gro_complete+0x7e> <--- 2nd conditional check: unlikely(!(*opps))

... (indirect call ops->callbacks.gro_complete)

ipv6_exthdrs_len contains a loop which has 3 conditional checks.
For the common (no extensions header) case, in the new code, *all 3
branches are completely avoided*

patched code ipv6_gro_complete:

// if (skb->encapsulation)
ffffffff81befe58: f6 83 81 00 00 00 20 testb $0x20,0x81(%rbx)
ffffffff81befe5f: 74 78 je ffffffff81befed9 <ipv6_gro_complete+0xb9>

...

// else
ffffffff81befed9: 0f b6 43 50 movzbl 0x50(%rbx),%eax
ffffffff81befedd: 0f b7 73 4c movzwl 0x4c(%rbx),%esi
ffffffff81befee1: 48 8b 0c c5 c0 3f a9 mov -0x7d56c040(,%rax,8),%rcx

... (indirect call ops->callbacks.gro_complete)

Thus, the patch is beneficial for both the common case and the ext hdr
case. I would appreciate a second consideration :)

> > while patch 1/2 could be useful, patch 2/2 overall looks not worthy to
> > me.
> >
> > I suggest to re-post for inclusion only patch 1, unless others have
> > strong different opinions.
> >
>
> +2
>
> I have the same feeling/opinion.
>
> > Cheers,
> >
> > Paolo
> >