Re: [PATCH net-next v4 00/27] io_uring zerocopy send

From: Pavel Begunkov
Date: Wed Jul 20 2022 - 09:33:00 EST


On 7/18/22 03:19, David Ahern wrote:
On 7/14/22 12:55 PM, Pavel Begunkov wrote:
You dropped comments about TCP testing; any progress there? If not,
can
you relay any issues you are hitting?

Not really a problem, but for me it's bottle necked at NIC bandwidth
(~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
Was actually benchmarked by my colleague quite a while ago, but can't
find numbers. Probably need to at least add localhost numbers or grab
a better server.

Testing localhost TCP with a hack (see below), it doesn't include
refcounting optimisations I was testing UDP with and that will be
sent afterwards. Numbers are in MB/s

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

I am surprised by the low numbers; you should be able to saturate a 100G
link with TCP and ZC TX API.

It was a quick test with my laptop, not a super fast CPU, preemptible
kernel, etc., and considering that the fact that it processes receives
from in the same send syscall roughly doubles the overhead, 87Gb/s
looks ok. It's not like MSG_ZEROCOPY would look much different, even
more to that all sends here will be executed sequentially in io_uring,
so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable,
it's just the kernel overhead per byte is too high, should be same with
just send(2).

?
It's a stream socket so those sends are coalesced into MTU sized packets.

That leaves syscall and io_uring overhead, locking the socket, etc.,
which still requires more cycles than just copying 1200 bytes. And
the used CPU is not blazingly fast, could be that a better CPU/setup
will saturate 100G

Because it's localhost, we also spend cycles here for the recv side.
Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help. I don't consider it to be a
blocker. but would be interesting to poke into later. One thing helping
non-zc is that it squeezes a number of requests into a single page
whenever zerocopy adds a new frag for every request.

Can't say anything new for larger payloads, I'm still NIC-bound but
looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
Also, I don't remember if mentioned before, but another catch is that
with TCP it expects users to not be flushing notifications too much,
because it forces it to allocate a new skb and lose a good chunk of
benefits from using TCP.

I had issues with TCP sockets and io_uring at the end of 2020:
https://www.spinics.net/lists/io-uring/msg05125.html

have not tried anything recent (from 2022).

Haven't seen it back then. In general io_uring doesn't stop submitting
requests if one request fails, at least because we're trying to execute
requests asynchronously. And in general, requests can get executed
out of order, so most probably submitting a bunch of requests to a single
TCP sock without any ordering on io_uring side is likely a bug.

TCP socket buffer fills resulting in a partial send (i.e, for a given
sqe submission only part of the write/send succeeded). io_uring was not
handling that case.

Shouldn't have been different from send(2) with MSG_NOWAIT, can be short
and the user should handle it. Also I believe Jens pushed just recently
in-kernel retries on the io_uring side for TCP in such cases.

I'll try to find some time to resurrect the iperf3 patch and try top of
tree kernel.

Awesome


You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing
execution ordering. And if you meant links in the message, I agree
that it was not the best decision to consider len < sqe->len not
an error and not breaking links, but it was later added that
MSG_WAITALL would also change the success condition to
len==sqe->len. But all that is relevant if you was using linking.

--
Pavel Begunkov