Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE

From: Stefan Hajnoczi
Date: Thu Jul 01 2021 - 09:16:02 EST


On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
> >
> > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
> > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > + static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > > > + {
> > > > > + int fd;
> > > > > + void *addr;
> > > > > + size_t size;
> > > > > + struct vduse_iotlb_entry entry;
> > > > > +
> > > > > + entry.start = iova;
> > > > > + entry.last = iova + 1;
> > > >
> > > > Why +1?
> > > >
> > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > buffer for the full iova range, if necessary.
> > > >
> > >
> > > The function is used to translate iova to va. And the *len is not
> > > specified by the caller. Instead, it's used to tell the caller the
> > > length of the contiguous iova region from the specified iova. And the
> > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > overlapped iova region. So using iova + 1 should be enough here.
> >
> > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > wonder why userspace needs to assign a value at all if it's always +1.
> >
>
> If we need to get some iova regions in the specified range, we need
> the entry.last field. For example, we can use [0, ULONG_MAX] to get
> the first overlapped iova region which might be [4096, 8192]. But in
> this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> get the iova region including the specified iova.

I see, thanks for explaining!

> > > > > + return addr + iova - entry.start;
> > > > > + }
> > > > > +
> > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > >
> > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > works. There must be a way for userspace to report the device's
> > > > supported feature bits to the kernel.
> > > >
> > >
> > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > device's supported feature bits when creating a new VDUSE device with
> > > ioctl(VDUSE_CREATE_DEV).
> >
> > Can the VDUSE device influence feature bit negotiation? For example, if
> > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > does QEMU or the guest find out about this?
> >
>
> There is a "features" field in struct vduse_dev_config which is used
> to do feature negotiation.

This approach is more restrictive than required by the VIRTIO
specification:

"The device SHOULD accept any valid subset of features the driver
accepts, otherwise it MUST fail to set the FEATURES_OK device status
bit when the driver writes it."

https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002

The spec allows a device to reject certain subsets of features. For
example, if feature B depends on feature A and can only be enabled when
feature A is also enabled.

From your description I think VDUSE would accept feature B without
feature A since the device implementation has no opportunity to fail
negotiation with custom logic.

Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
the device implementation full flexibility in which subsets of features
to accept.

This is a corner case. Many or maybe even all existing VIRTIO devices
don't need this flexibility, but I want to point out this limitation in
the VDUSE interface because it may cause issues in the future.

> > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> > > >
> > > > Does this mean the contents of the configuration space are cached by
> > > > VDUSE?
> > >
> > > Yes, but the kernel will also store the same contents.
> > >
> > > > The downside is that the userspace code cannot generate the
> > > > contents on demand. Most devices doin't need to generate the contents
> > > > on demand, so I think this is okay but I had expected a different
> > > > interface:
> > > >
> > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > >
> > >
> > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > > will need lots of modification of virtio codes to support that. So to
> > > make it simple, we choose this way:
> > >
> > > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > >
> > > > I think you can leave it the way it is, but I wanted to mention this in
> > > > case someone thinks it's important to support generating the contents of
> > > > the configuration space on demand.
> > > >
> > >
> > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
> >
> > If the contents of the configuration space change continuously, then the
> > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
> > conditions. For example, imagine a device where the driver can read a
> > timer from the configuration space. I think the VIRTIO device model
> > allows that although I'm not aware of any devices that do something like
> > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
> > called frequently to keep the timer value updated even though the guest
> > driver probably isn't accessing it.
> >
>
> OK, I get you now. Since the VIRTIO specification says "Device
> configuration space is generally used for rarely-changing or
> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

> > What's worse is that there might be race conditions where other
> > driver->device operations are supposed to update the configuration space
> > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
> > outdated copy.
> >
>
> I'm not sure. Should the device and driver be able to access the same
> fields concurrently?

Yes. The VIRTIO spec has a generation count to handle multi-field
accesses so that consistency can be ensured:
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004

>
> > Again, I don't think it's a problem for existing devices in the VIRTIO
> > specification. But I'm not 100% sure and future devices might require
> > what I've described, so the VDUSE_DEV_SET_CONFIG interface could become
> > a problem.
> >
>
> If so, maybe a new interface can be added at that time. The
> VDUSE_DEV_GET_CONFIG might be better, but I still did not find a good
> way for failure handling.

I'm not aware of the details of why the current approach was necessary,
so I don't have any concrete suggestions. Sorry!

Stefan

Attachment: signature.asc
Description: PGP signature