Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

From: Michael S. Tsirkin
Date: Wed Sep 27 2017 - 18:19:47 EST


On Wed, Sep 27, 2017 at 10:04:18AM +0800, Jason Wang wrote:
>
>
> On 2017å09æ27æ 03:25, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:
> > > This patch implements basic batched processing of tx virtqueue by
> > > prefetching desc indices and updating used ring in a batch. For
> > > non-zerocopy case, vq->heads were used for storing the prefetched
> > > indices and updating used ring. It is also a requirement for doing
> > > more batching on top. For zerocopy case and for simplicity, batched
> > > processing were simply disabled by only fetching and processing one
> > > descriptor at a time, this could be optimized in the future.
> > >
> > > XDP_DROP (without touching skb) on tun (with Moongen in guest) with
> > > zercopy disabled:
> > >
> > > Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
> > > Before: 3.20Mpps
> > > After: 3.90Mpps (+22%)
> > >
> > > No differences were seen with zerocopy enabled.
> > >
> > > Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx>
> > So where is the speedup coming from? I'd guess the ring is
> > hot in cache, it's faster to access it in one go, then
> > pass many packets to net stack. Is that right?
> >
> > Another possibility is better code cache locality.
>
> Yes, I think the speed up comes from:
>
> - less cache misses
> - less cache line bounce when virtqueue is about to be full (guest is faster
> than host which is the case of MoonGen)
> - less memory barriers
> - possible faster copy speed by using copy_to_user() on modern CPUs
>
> >
> > So how about this patchset is refactored:
> >
> > 1. use existing APIs just first get packets then
> > transmit them all then use them all
>
> Looks like current API can not get packets first, it only support get packet
> one by one (if you mean vhost_get_vq_desc()). And used ring updating may get
> more misses in this case.

Right. So if you do

for (...)
vhost_get_vq_desc


then later

for (...)
vhost_add_used


then you get most of benefits except maybe code cache misses
and copy_to_user.







> > 2. add new APIs and move the loop into vhost core
> > for more speedups
>
> I don't see any advantages, looks like just need some e.g callbacks in this
> case.
>
> Thanks

IUC callbacks pretty much destroy the code cache locality advantages,
IP is jumping around too much.


--
MST