Re: RFC: zero copy recv()

From: Eric Dumazet
Date: Thu Apr 25 2019 - 13:50:09 EST




On 4/25/19 1:01 AM, Maxim Uvarov wrote:
> On Wed, 24 Apr 2019 at 18:59, Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
>>
>>
>>
>> On 04/23/2019 11:23 PM, Maxim Uvarov wrote:
>>> Hello,
>>>
>>> On different conferences I see that people are trying to accelerate
>>> network with putting packet processing with protocol level completely
>>> to user space. It might be DPDK, ODP or AF_XDP plus some network
>>> stack on top of it. Then people are trying to test this solution with
>>> some existence applications. And in better way do not modify
>>> application binaries and just LD_PRELOAD sockets syscalls (recv(),
>>> sendto() and etc). Current recv() expects that application allocates
>>> memory and call will "copy" packet to that memory. Copy per packet is
>>> slow. Can we consider about implementing zero copy API calls
>>> friendly? Can this change be accepted to kernel?
>>
>
> Hello Eric, thanks for responding.
>
>> Generic zero copy is hard.
>>
>
> yes that is true.
>
>> As soon as you have multiple consumers in different domains for the data,
>> you need some kind of multiplexing, typically using hardware capabilities.
>>
>> For TCP, we implemented zero copy last year, which works quite well
>> on x86 if your network uses MTU of 4096+headers.
>>
>> tools/testing/selftests/net/tcp_mmap.c reaches line rate (100Gbit) on
>> a single TCP flow, if using a NIC able to perform header split.
>>
>
> That is great work. But isn't there context switches on
> getsockopt(TCP_ZEROCOPY_RECEIVE) and read() per packet?

No, since in many cases you actually know how many bytes are expected to be received.

SO_RCVLOWAT can be used by the application to tell the kernel :

- Please send me an EPOLLIN only when you have at least XXXXXX bytes available in receive queue.

>
> I played with AF_XDP where one core can be isolated and do polling of
> umem pool memory and some other core can do softirq processing.
> And polling of umem is really fast - about 96ns on 2.5Ghz x86 laptop
> and no context switches on umem polling core.

Sure, but again this is very far from being 'generic', let say if you want to reuse TCP stack...

>
> But in general for tcp_mmap.c code if getsockopt()+read() will be
> changed to one zero copy call, something like recvmsg_zc() then it can
> be LD_PRELOADED.
> mmap() can be also moved under socket creation to simplify api. Does
> it look reasonable?

Honestly I prefer not having to play games like that.

They are many subtle issues there really.

>
>> But the model is not to run a legacy application with some LD_PRELOAD
>> hack/magic, sorry.
>>
> More likely that legacy applications will like to use zero copy
> networking. Once api will be stable they will support it, especially
> if api can be used with minimal changes for apps.
> Than it will be quite easy to LD_PRELOAD hack or change application to
> use some other IP stack.
>
> Maxim.
>