Re: [PATCH v2 bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

From: Magnus Karlsson
Date: Mon Apr 12 2021 - 10:13:26 EST


On Wed, Mar 31, 2021 at 2:27 PM Alexander Lobakin <alobakin@xxxxx> wrote:
>
> This series is based on the exceptional generic zerocopy xmit logics
> initially introduced by Xuan Zhuo. It extends it the way that it
> could cover all the sane drivers, not only the ones that are capable
> of xmitting skbs with no linear space.
>
> The first patch is a random while-we-are-here improvement over
> full-copy path, and the second is the main course. See the individual
> commit messages for the details.
>
> The original (full-zerocopy) path is still here and still generally
> faster, but for now it seems like virtio_net will remain the only
> user of it, at least for a considerable period of time.
>
> From v1 [0]:
> - don't add a whole SMP_CACHE_BYTES because of only two bytes
> (NET_IP_ALIGN);
> - switch to zerocopy if the frame is 129 bytes or longer, not 128.
> 128 still fit to kmalloc-512, while a zerocopy skb is always
> kmalloc-1024 -> can potentially be slower on this frame size.
>
> [0] https://lore.kernel.org/netdev/20210330231528.546284-1-alobakin@xxxxx
>
> Alexander Lobakin (2):
> xsk: speed-up generic full-copy xmit

I took both your patches for a spin on my machine and for the first
one I do see a small but consistent drop in performance. I thought it
would go the other way, but it does not so let us put this one on the
shelf for now.

> xsk: introduce generic almost-zerocopy xmit

This one wreaked havoc on my machine ;-). The performance dropped with
75% for packets larger than 128 bytes when the new scheme kicks in.
Checking with perf top, it seems that we spend much more time
executing the sendmsg syscall. Analyzing some more:

$ sudo bpftrace -e 'kprobe:__sys_sendto { @calls = @calls + 1; }
interval:s:1 {printf("calls/sec: %d\n", @calls); @calls = 0;}'
Attaching 2 probes...
calls/sec: 1539509 with your patch compared to

calls/sec: 105796 without your patch

The application spends a lot of more time trying to get the kernel to
send new packets, but the kernel replies with "have not completed the
outstanding ones, so come back later" = EAGAIN. Seems like the
transmission takes longer when the skbs have fragments, but I have not
examined this any further. Did you get a speed-up?

> net/xdp/xsk.c | 32 ++++++++++++++++++++++----------
> 1 file changed, 22 insertions(+), 10 deletions(-)
>
> --
> Well, this is untested. I currently don't have an access to my setup
> and is bound by moving to another country, but as I don't know for
> sure at the moment when I'll get back to work on the kernel next time,
> I found it worthy to publish this now -- if any further changes will
> be required when I already will be out-of-sight, maybe someone could
> carry on to make a another revision and so on (I'm still here for any
> questions, comments, reviews and improvements till the end of this
> week).
> But this *should* work with all the sane drivers. If a particular
> one won't handle this, it's likely ill. Any tests are highly
> appreciated. Thanks!
> --
> 2.31.1
>
>