Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

From: Jason Wang
Date: Wed Jul 01 2020 - 10:10:37 EST



On 2020/7/1 äå9:04, Eugenio Perez Martin wrote:
On Wed, Jul 1, 2020 at 2:40 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:

On 2020/7/1 äå6:43, Eugenio Perez Martin wrote:
On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin
<eperezma@xxxxxxxxxx> wrote:
On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote:
On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote:
On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin
<eperezma@xxxxxxxxxx> wrote:
On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk
<konrad.wilk@xxxxxxxxxx> wrote:
On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote:
As testing shows no performance change, switch to that now.
What kind of testing? 100GiB? Low latency?

Hi Konrad.

I tested this version of the patch:
https://lkml.org/lkml/2019/10/13/42

It was tested for throughput with DPDK's testpmd (as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html)
and kernel pktgen. No latency tests were performed by me. Maybe it is
interesting to perform a latency test or just a different set of tests
over a recent version.

Thanks!
I have repeated the tests with v9, and results are a little bit different:
* If I test opening it with testpmd, I see no change between versions
OK that is testpmd on guest, right? And vhost-net on the host?

Hi Michael.

No, sorry, as described in
http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html.
But I could add to test it in the guest too.

These kinds of raw packets "bursts" do not show performance
differences, but I could test deeper if you think it would be worth
it.
Oh ok, so this is without guest, with virtio-user.
It might be worth checking dpdk within guest too just
as another data point.

Ok, I will do it!

* If I forward packets between two vhost-net interfaces in the guest
using a linux bridge in the host:
And here I guess you mean virtio-net in the guest kernel?
Yes, sorry: Two virtio-net interfaces connected with a linux bridge in
the host. More precisely:
* Adding one of the interfaces to another namespace, assigning it an
IP, and starting netserver there.
* Assign another IP in the range manually to the other virtual net
interface, and start the desired test there.

If you think it would be better to perform then differently please let me know.
Not sure why you bother with namespaces since you said you are
using L2 bridging. I guess it's unimportant.

Sorry, I think I should have provided more context about that.

The only reason to use namespaces is to force the traffic of these
netperf tests to go through the external bridge. To test netperf
different possibilities than the testpmd (or pktgen or others "blast
of frames unconditionally" tests).

This way, I make sure that is the same version of everything in the
guest, and is a little bit easier to manage cpu affinity, start and
stop testing...

I could use a different VM for sending and receiving, but I find this
way a faster one and it should not introduce a lot of noise. I can
test with two VM if you think that this use of network namespace
introduces too much noise.

Thanks!

- netperf UDP_STREAM shows a performance increase of 1.8, almost
doubling performance. This gets lower as frame size increase.
Regarding UDP_STREAM:
* with event_idx=on: The performance difference is reduced a lot if
applied affinity properly (manually assigning CPU on host/guest and
setting IRQs on guest), making them perform equally with and without
the patch again. Maybe the batching makes the scheduler perform
better.

Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g
setting a sndbuf for TAP may help for the performance (reduce the drop).

Ok, will add that to the test. Thanks!


Actually, it's better to skip the UDP_STREAM test since:

- My understanding is very few application is using raw UDP stream
- It's hard to analyze (usually you need to count the drop ratio etc)



- rests of the test goes noticeably worse: UDP_RR goes from ~6347
transactions/sec to 5830
* Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes
them perform similarly again, only a very small performance drop
observed. It could be just noise.
** All of them perform better than vanilla if event_idx=off, not sure
why. I can try to repeat them if you suspect that can be a test
failure.

* With testpmd and event_idx=off, if I send from the VM to host, I see
a performance increment especially in small packets. The buf api also
increases performance compared with only batching: Sending the minimum
packet size in testpmd makes pps go from 356kpps to 473 kpps.

What's your setup for this. The number looks rather low. I'd expected
1-2 Mpps at least.

Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA nodes of 16G memory
each, and no device assigned to the NUMA node I'm testing in. Too low
for testpmd AF_PACKET driver too?


I don't test AF_PACKET, I guess it should use the V3 which mmap based zerocopy interface.

And it might worth to check the cpu utilization of vhost thread. It's required to stress it as 100% otherwise there could be a bottleneck somewhere.



Sending
1024 length UDP-PDU makes it go from 570kpps to 64 kpps.

Something strange I observe in these tests: I get more pps the bigger
the transmitted buffer size is. Not sure why.

** Sending from the host to the VM does not make a big change with the
patches in small packets scenario (minimum, 64 bytes, about 645
without the patch, ~625 with batch and batch+buf api). If the packets
are bigger, I can see a performance increase: with 256 bits,

I think you meant bytes?

Yes, sorry.

it goes
from 590kpps to about 600kpps, and in case of 1500 bytes payload it
gets from 348kpps to 528kpps, so it is clearly an improvement.

* with testpmd and event_idx=on, batching+buf api perform similarly in
both directions.

All of testpmd tests were performed with no linux bridge, just a
host's tap interface (<interface type='ethernet'> in xml),

What DPDK driver did you use in the test (AF_PACKET?).

Yes, both testpmd are using AF_PACKET driver.


I see, using AF_PACKET means extra layers of issues need to be analyzed which is probably not good.



with a
testpmd txonly and another in rxonly forward mode, and using the
receiving side packets/bytes data. Guest's rps, xps and interrupts,
and host's vhost threads affinity were also tuned in each test to
schedule both testpmd and vhost in different processors.

My feeling is that if we start from simple setup, it would be more
easier as a start. E.g start without an VM.

1) TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP
2) RX: pkgetn -> TAP -> vhost_net -> testpmd(rxonly)

Got it. Is there a reason to prefer pktgen over testpmd?


I think the reason is using testpmd you must use a userspace kernel interface (AF_PACKET), and it could not be as fast as pktgen since:

- it talks directly to xmit of TAP
- skb can be cloned

Thanks



Thanks


I will send the v10 RFC with the small changes requested by Stefan and Jason.

Thanks!







OK so it seems plausible that we still have a bug where an interrupt
is delayed. That is the main difference between pmd and virtio.
Let's try disabling event index, and see what happens - that's
the trickiest part of interrupts.

Got it, will get back with the results.

Thank you very much!

- TCP_STREAM goes from ~10.7 gbps to ~7Gbps
- TCP_RR from 6223.64 transactions/sec to 5739.44