Re: vbus design points: shm and shm-signals

From: Anthony Liguori
Date: Mon Aug 24 2009 - 19:57:35 EST


Gregory Haskins wrote:
Hi Anthony,

Fundamentally, how is this different than the virtio->add_buf concept?

From my POV, they are at different levels. Calling vbus->shm() is for
establishing a shared-memory region including routing the memory and
signal-path contexts. You do this once at device init time, and then
run some algorithm on top (such as a virtqueue design).

virtio explicitly avoids having a single setup-memory-region call because it was designed to accommodate things like Xen grant tables whereas you have a fixed number of sharable
buffers that need to be setup and torn down as you use them.

You can certainly use add_buf() to setup a persistent mapping but it's not the common usage. For KVM, since all memory is accessible by the host without special setup, add_buf() never results in an exit (it's essentially a nop).

So I think from that perspective, add_buf() is a functional superset of vbus->shm().

virtio->add_buf() OTOH, is a run-time function. You do this to modify
the shared-memory region that is already established at init time by
something like vbus->shm(). You would do this to queue a network
packet, for instance.

That said, shm-signal's closest analogy to virtio would be vq->kick(),
vq->callback(), vq->enable_cb(), and vq->disable_cb(). The difference
is that the notification mechanism isn't associated with a particular
type of shared-memory construct (such as a virt-queue), but instead can
be used with any shared-mem algorithm (at least, if I designed it properly).

Obviously, virtio allows multiple ring implements based on how it does layering. The key point is that it doesn't expose that to the consumer of the device.

Do you see a compelling reason to have an interface at this layer?

virtio provides a mechanism to register scatter/gather lists, associate
a handle with them, and provides a mechanism for retrieving notification
that the buffer has been processed.

Yes, and I agree this is very useful for many/most algorithms...but not
all. Sometimes you don't want ring-like semantics, but instead want
something like an idempotent table. (Think of things like interrupt
controllers, timers, etc).

We haven't crossed this bridge yet because we haven't implemented one of these devices. One approach would be to use add_buf() to register fixed shared memory regions. Because our rings are fixed sized, this implies a fixed number of shared memory mappings.

You could also extend virtio to provide a mechanism to register unlimited numbers of shared memory regions. The problem with this is that it doesn't work well for hypervisors with fixed shared-memory regions (like Xen).
However, sometimes you may want to say "time is now X", and later "time
is now Y". The update value of 'X' is technically superseded by Y and
is stale. But a ring may allow both to exist in-flight within the shm
simultaneously if the recipient (guest or host) is lagging, and the X
may be processed even though its data is now irrelevant. What we really
want is the transform of X->Y to invalidate anything else in flight so
that only Y is visible.

We actually do this today but we just don't use virtio. I'm not sure we need a single bus that can serve both of these purposes. What does this abstraction buy us?

If you think about it, a ring is a superset of this construct...the ring
meta-data is the "shared-table" (e.g. HEAD ptr, TAIL ptr, COUNT, etc).
So we start by introducing the basic shm concept, and allow the next
layer (virtio/virtqueue) in the stack to refine it for its needs.

I think there's a trade off between practicality and theoretical abstractions. Surely, a system can be constructed simply with notification and shared memory primitives. This is what Xen does via event channels and grant tables. In practice, this ends up being cumbersome and results in complex drivers. Compare netfront to virtio-net, for instance.

We choose to abstract at the ring level precisely because it simplifies driver implementations. I think we've been very successful here.

virtio does not accommodate devices that don't fit into a ring model very well today. There's certainly room to discuss how to do this. If there is to be a layer below virtio's ring semantics, I don't think that vbus is this because it mandates much higher levels of the stack (namely, device enumeration).

IOW, I can envision a model that looked like PCI -> virtio-pci -> virtio-shm -> virtio-ring -> virtio-net

Whereas generic-shm-mechanism provided a non-ring interface for non-ring devices. That doesn't preclude non virtio-pci transports, it just suggests how we would do the layering.

So maybe there's a future for vbus as virtio-shm? How attached are you to your device discovery infrastructure?

If you introduced a virtio-shm layer to the virtio API that looked a bit like vbus' device API, and then decoupled the device discovery bits into a virtio-vbus transport, I think you'd end up with something that was quite agreeable.

As a transport, PCI has significant limitations. The biggest being the maximum number of devices we can support. It's biggest advantage though is portability so it's something I think we would always want to support. However, having a virtio transport optimized for Linux's guests is something I would certainly support.

vbus provides a mechanism to register a single buffer with an integer
handle, priority, and a signaling mechanism.

Again, I think we are talking about two different layers. You would
never put entries into a virtio-ring of different priority. This
doesn't make sense, as they would just get linearized by the fifo.

What you *would* do is possibly make multiple virtqueues, each with a
different priority (for instance, say 8-rx queues for virtio-net).

I think priority is an overloaded concept. I'm not sure it belongs in a generic memory sharing API.

What does one do with priority, btw?

There are, of course, many answers to that question. One particularly
trivial example is 802.1p networking. So, for instance, you can
classify and prioritize network traffic so that things like
control/timing packets are higher priority than best-effort HTTP.

Wouldn't you do this at a config-space level though? I agree you would want to have multiple rings with individual priority, but I think priority is a ring configuration just as programmable triplet filtering would be a per-ring configuration. I also think how priority gets interpreted really depends on the device so it belongs in the device's ABI instead of the shared memory or ring ABI.

HTH,

It does, thanks.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/