Re: [PATCH net-next] tun/macvtap: limit the packets queued throughrcvbuf

From: Michael S. Tsirkin
Date: Thu Jan 16 2014 - 00:49:40 EST

On Thu, Jan 16, 2014 at 12:29:35PM +0800, Jason Wang wrote:
> On 01/15/2014 03:21 PM, Michael S. Tsirkin wrote:
> >On Wed, Jan 15, 2014 at 11:36:01AM +0800, Jason Wang wrote:
> >>On 01/14/2014 05:52 PM, Michael S. Tsirkin wrote:
> >>>On Tue, Jan 14, 2014 at 04:45:24PM +0800, Jason Wang wrote:
> >>>>>On 01/14/2014 04:25 PM, Michael S. Tsirkin wrote:
> >>>>>>>On Tue, Jan 14, 2014 at 02:53:07PM +0800, Jason Wang wrote:
> >>>>>>>>>We used to limit the number of packets queued through tx_queue_length. This
> >>>>>>>>>has several issues:
> >>>>>>>>>
> >>>>>>>>>- tx_queue_length is the control of qdisc queue length, simply reusing it
> >>>>>>>>> to control the packets queued by device may cause confusion.
> >>>>>>>>>- After commit 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 ("macvtap: Add
> >>>>>>>>> support of packet capture on macvtap device."), an unexpected qdisc
> >>>>>>>>> caused by non-zero tx_queue_length will lead qdisc lock contention for
> >>>>>>>>> multiqueue deivce.
> >>>>>>>>>- What we really want is to limit the total amount of memory occupied not
> >>>>>>>>> the number of packets.
> >>>>>>>>>
> >>>>>>>>>So this patch tries to solve the above issues by using socket rcvbuf to
> >>>>>>>>>limit the packets could be queued for tun/macvtap. This was done by using
> >>>>>>>>>sock_queue_rcv_skb() instead of a direct call to skb_queue_tail(). Also two
> >>>>>>>>>new ioctl() were introduced for userspace to change the rcvbuf like what we
> >>>>>>>>>have done for sndbuf.
> >>>>>>>>>
> >>>>>>>>>With this fix, we can safely change the tx_queue_len of macvtap to
> >>>>>>>>>zero. This will make multiqueue works without extra lock contention.
> >>>>>>>>>
> >>>>>>>>>Cc: Vlad Yasevich<vyasevic@xxxxxxxxxx>
> >>>>>>>>>Cc: Michael S. Tsirkin<mst@xxxxxxxxxx>
> >>>>>>>>>Cc: John Fastabend<john.r.fastabend@xxxxxxxxx>
> >>>>>>>>>Cc: Stephen Hemminger<stephen@xxxxxxxxxxxxxxxxxx>
> >>>>>>>>>Cc: Herbert Xu<herbert@xxxxxxxxxxxxxxxxxxx>
> >>>>>>>>>Signed-off-by: Jason Wang<jasowang@xxxxxxxxxx>
> >>>>>>>No, I don't think we can change userspace-visible behaviour like that.
> >>>>>>>
> >>>>>>>This will break any existing user that tries to control
> >>>>>>>queue length through sysfs,netlink or device ioctl.
> >>>>>But it looks like a buggy API, since tx_queue_len should be for qdisc
> >>>>>queue length instead of device itself.
> >>>Probably, but it's been like this since 2.6.x time.
> >>>Also, qdisc queue is unused for tun so it seemed kind of
> >>>reasonable to override tx_queue_len.
> >>>
> >>>>>If we really want to preserve the
> >>>>>behaviour, how about using a new feature flag and change the behaviour
> >>>>>only when the device is created (TUNSETIFF) with the new flag?
> >>>OK this addresses the issue partially, but there's also an issue
> >>>of permissions: tx_queue_len can only be changed if
> >>>capable(CAP_NET_ADMIN). OTOH in your patch a regular user
> >>>can change the amount of memory consumed per queue
> >>>by calling TUNSETRCVBUF.
> >>Yes, but we have the same issue for TUNSETSNDBUF.
> >To an extent, but TUNSETSNDBUF is different. It limits how much device can queue
> >*in the networking stack* but each queue in the stack is also
> >limited, when we exceed that we star dropping packets.
> >So while with infinite value (which is the default btw)
> >you can keep host pretty busy, you will not be able to run
> >it out of memory.
> >
> >The proposed TUNSETRCVBUF would keep configured amount
> >of memory around indefinitely so you can run host out of memory.
> >
> >So assuming all this
> >How about an ethtool or netlink command to configure this
> >instead?
> >
> Ok, so we can add net admin check for before trying to set rcvbuf.

No, in practice I think using ioctl for sndbuf was also a mistake.
Applications have no idea what to set it to - you need to know what else
is running on the system, after a while
QEMU ended up setting it back to infinity otherwise things kept

ethtool or netlink would not have this problem.
Which of the two is preferable I'm not sure.
I wonder what do management tools such as libvirt prefer.

> I
> think it's better to use ioctl since we've already use it for
> sndbuf. Using ethool means you need a dedicated new ethtool method
> just for tuntap which seems sub-optimal.
> Netlink looks better, but
> we should also implement other ioctl also.

I'm not sure what this last phrase means. Can you clarify pls?

> >>>>>>>Take a look at my patch in msg ID 20140109071721.GD19559@xxxxxxxxxx
> >>>>>>>which gives one way to set tx_queue_len to zero without
> >>>>>>>breaking userspace.
> >>>>>If I read the patch correctly, it will make no way for the user who
> >>>>>really want to change the qdisc queue length for tun.
> >>>Why would this matter? As far as I can see qdisc queue is currently unused.
> >>>
> >>User may use qdisc to do port mirroring, bandwidth limitation, traffic
> >>prioritization or more for a VM. So we do have users and maybe more
> >>consider the case of vpn.
> >Well it's not used by default at least.
> >I remember that we discussed this previously actually.
> >
> >If all we want to do actually is utilize no_qdisc by default,
> >we can simply use Eric's patch:
> >
> >
> >
> >and a similar patch for macvtap.
> >I tried it at the time and it didn't seem to help performance
> >at all, but a lot has changed since, in particular I didn't
> >test mq.
> >
> >If you now have results showing how it's beneficial, pls post them.
> >
> I will have a test to see the difference.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at