Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

From: Yan Zhao
Date: Wed May 08 2019 - 23:17:49 EST


On Thu, May 09, 2019 at 05:22:42AM +0800, Alex Williamson wrote:
> On Wed, 8 May 2019 07:27:40 -0400
> Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
>
> > On Wed, May 08, 2019 at 05:18:26AM +0800, Alex Williamson wrote:
> > > On Sun, 5 May 2019 21:49:04 -0400
> > > Yan Zhao <yan.y.zhao@xxxxxxxxx> wrote:
> > >
> > > > version attribute is used to check two mdev devices' compatibility.
> > > >
> > > > The key point of this version attribute is that it's rw.
> > > > User space has no need to understand internal of device version and no
> > > > need to compare versions by itself.
> > > > Compared to reading version strings from both two mdev devices being
> > > > checked, user space only reads from one mdev device's version attribute.
> > > > After getting its version string, user space writes this string into the
> > > > other mdev device's version attribute. Vendor driver of mdev device
> > > > whose version attribute being written will check device compatibility of
> > > > the two mdev devices for user space and return success for compatibility
> > > > or errno for incompatibility.
> > > > So two readings of version attributes + checking in user space are now
> > > > changed to one reading + one writing of version attributes + checking in
> > > > vendor driver.
> > > > Format and length of version strings are now private to vendor driver
> > > > who can define them freely.
> > > >
> > > > __ user space
> > > > /\ \
> > > > / \write
> > > > / read \
> > > > ______/__ ___\|/___
> > > > | version | | version |-->check compatibility
> > > > ----------- -----------
> > > > mdev device A mdev device B
> > > >
> > > > This version attribute is optional. If a mdev device does not provide
> > > > with a version attribute, this mdev device is incompatible to all other
> > > > mdev devices.
> > > >
> > > > Live migration is able to take advantage of this version attribute.
> > > > Before user space actually starts live migration, it can first check
> > > > whether two mdev devices are compatible.
> > > >
> > > > v2:
> > > > 1. added detailed intent and usage
> > > > 2. made definition of version string completely private to vendor driver
> > > > (Alex Williamson)
> > > > 3. abandoned changes to sample mdev drivers (Alex Williamson)
> > > > 4. mandatory --> optional (Cornelia Huck)
> > > > 5. added description for errno (Cornelia Huck)
> > > >
> > > > Cc: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > > Cc: Erik Skultety <eskultet@xxxxxxxxxx>
> > > > Cc: "Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx>
> > > > Cc: Cornelia Huck <cohuck@xxxxxxxxxx>
> > > > Cc: "Tian, Kevin" <kevin.tian@xxxxxxxxx>
> > > > Cc: Zhenyu Wang <zhenyuw@xxxxxxxxxxxxxxx>
> > > > Cc: "Wang, Zhi A" <zhi.a.wang@xxxxxxxxx>
> > > > Cc: Neo Jia <cjia@xxxxxxxxxx>
> > > > Cc: Kirti Wankhede <kwankhede@xxxxxxxxxx>
> > > > Cc: Daniel P. Berrangé <berrange@xxxxxxxxxx>
> > > > Cc: Christophe de Dinechin <dinechin@xxxxxxxxxx>
> > > >
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@xxxxxxxxx>
> > > > ---
> > > > Documentation/vfio-mediated-device.txt | 140 +++++++++++++++++++++++++
> > > > 1 file changed, 140 insertions(+)
> > > >
> > > > diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
> > > > index c3f69bcaf96e..013a764968eb 100644
> > > > --- a/Documentation/vfio-mediated-device.txt
> > > > +++ b/Documentation/vfio-mediated-device.txt
> > > > @@ -202,6 +202,7 @@ Directories and files under the sysfs for Each Physical Device
> > > > | | |--- available_instances
> > > > | | |--- device_api
> > > > | | |--- description
> > > > + | | |--- version
> > > > | | |--- [devices]
> > > > | |--- [<type-id>]
> > > > | | |--- create
> > > > @@ -209,6 +210,7 @@ Directories and files under the sysfs for Each Physical Device
> > > > | | |--- available_instances
> > > > | | |--- device_api
> > > > | | |--- description
> > > > + | | |--- version
> > > > | | |--- [devices]
> > > > | |--- [<type-id>]
> > > > | |--- create
> > > > @@ -216,6 +218,7 @@ Directories and files under the sysfs for Each Physical Device
> > > > | |--- available_instances
> > > > | |--- device_api
> > > > | |--- description
> > > > + | |--- version
> > > > | |--- [devices]
> > >
> > > I thought there was a request to make this more specific to migration
> > > by renaming it to something like migration_version. Also, as an
> > >
> > so this attribute may not only include a mdev device's parent device info and
> > mdev type, but also include numeric software version of vendor specific
> > migration code, right?
>
> It's a vendor defined string, it should be considered opaque to the
> user, the vendor can include whatever they feel is relevant.
>
> > This actually makes sense.
> > So, do I need to add a disclaimer in this doc like:
> > vendor driver should be responsible by itself for a mdev device's migration
> > compatibility.
>
> I thought that was the purpose of this attribute.
>
> > During migration setup phase, general migration code in user space VFIO only
> > checks this version of VFIO migration region, and will not check software version
> > of vendor specific migration code.
>
> What is "software version of vendor specific migration code"? If
> you're asking whether anything will check for parent device
> compatibility or parent vendor driver compatibility, the answer is no,
> that's what this interface is meant to provide. Userspace should need
> to do nothing more than verify the mdev types match and then use the
> version attribute to confirm source to target compatibility.
>
> > It is suggested to incorporate at least parent device info and software version
> > of vendor specific migration code into this migration_version attribute.
>
> We can make recommendations as "best practices", but ultimately it's an
> opaque string defined by the vendor driver.
>
> But you never addressed my comment that previous reviews asked for the
> attribute to be named something more specific to migration.
>
I aggree to rename it to migration_version.

> > > optional attribute, it seems the example should perhaps not add it to
> > > all types to illustrate that it is not required.
> > ok. got it.
> >
> >
> > > >
> > > > * [mdev_supported_types]
> > > > @@ -246,6 +249,143 @@ Directories and files under the sysfs for Each Physical Device
> > > > This attribute should show the number of devices of type <type-id> that can be
> > > > created.
> > > >
> > > > +* version
> > > > +
> > > > + This attribute is rw, and is optional.
> > > > + It is used to check device compatibility between two mdev devices and is
> > >
> > > between two mdev devices of the same type.
> > >
> > ok. got it.
> > But I have a question about aggregation proposed earlier.
> > Do we also have to assume the two mdev devices are of the same aggregation
> > count?
> > However, aggregation count is not available before a mdev device is created. :(
>
> We don't support aggregation yet, but yes, that's going to introduce
> issues here. Any configuration beyond the base mdev type would imply
> that the base type could be compatible for migration, but the specific
> instances might not. Resolving that would imply that our version
> information needs to be relative to an instance, not just the base type.
>
> How would we extend this interface to support that? We could have a
> version attribute on each device instance, which might report something
> like:
>
> 0123456789,aggregate=2
>
> IOW the device instance of version concatenates the mdev type version
> with the additional create parameters for that device. Writing this to
> the type attribute should be parsed by the vendor driver as support for
> given base device with specified additional create parameters.
>
> I'm afraid this also bring us around to treacherous questions around
> who is responsible for creating that device on the migration target and
> where is this meta information about the device exposed. Maybe instead
> of a per instance version attribute we would instead expose the create
> parameters as an attribute per instance and it would be userspace's
> responsibility to create a version string from the mdev type and create
> parameters similar to above. This would also make it possible to
> create a compatible instance on the target without pre-knowledge of how
> the device was created.
>
> Also, this issue exists already, but compatibility and capacity are two
> separate things, I think we want to limit this interface to the
> former. For instance, if I want to migrate an i915-GVTg_V5_1 device to
> another system where available_instances for that type is zero, the
> version attribute should strictly report the device compatibility, it's
> not responsible for returning an errno due to lack of resources.
> Similarly if we were to do something with aggregation, the version
> attribute would strictly report if the target supports creating that
> device with those parameters, not whether it has capacity to create
> such as device at that instant in time.
>
I think it's good to have a migration_version attribute under each device
instance.
It has two pros:
1. vendor driver can incorperate into the string things like:
parent device info + mdev type + aggregate count + software version
2. even for non mdev devices, like a VF in SRIOV
PF driver can export this migration_version under a VF instance. so a VF is
possible to migrate with vfio-pci driver installed. (though with current VFIO
live migration RFCs, with vfio-pci driver is not able to migrate. but it
provides a possibility)


could we maintain two migration_version attributes?
"create parameter + mdev_type migraton_version" for user space software to
create a migration compatible mdev device.
"per device instance migration_version" for verifying migration compatibility of
devices already created.


> > > > + accessed in pairs between the two mdev devices being checked.
> > >
> > > "in pairs"?
> > I meant, user space needs to access version attributes from two mdev device.
> > but seems that it's needless to mention that... I'll remove it :)
> >
> >
> > > > + The intent of this attribute is to make an mdev device's version opaque to
> > > > + user space, so instead of reading two mdev devices' version strings and
> > >
> > > perhaps "...instead of reading the version string of two mdev devices
> > > and comparing them in userspace..."
> > yes, better, thanks:)
> >
> > > > + comparing in userspace, user space should only read one mdev device's version
> > > > + attribute, and writes this version string into the other mdev device's version
> > > > + attribute. Then vendor driver of mdev device whose version attribute being
> > > > + written would check the incoming version string and tell user space whether
> > > > + the two mdev devices are compatible via return value. That's why this
> > > > + attribute is writable.
> > > > +
> > > > + when reading this attribute, it should show device version string of
> > > > + the device of type <type-id>.
> > > > +
> > > > + This string is private to vendor driver itself. Vendor driver is able to
> > > > + freely define format and length of device version string.
> > > > + e.g. It can use a combination of pciid of parent device + mdev type.
> > >
> > > Can the user assume the data contents of the string is ascii
> > > characters? It's good that the vendor driver defines the format and
> > > length, but the user probably needs some expectation bounding that
> > > length. Should we define it as no larger than PATH_MAX (4096), or maybe
> > > NAME_MAX (255) might be more reasonable?
> > I think so. I'll add those restrictions in next revision.
>
> If we start adding creation parameters, PATH_MAX may actually be the
> more reasonable limit.
>
got it.
> > > > +
> > > > + When writing a string to this attribute, vendor driver should analyze this
> > > > + string and check whether the mdev device being identified by this string is
> > > > + compatible with the mdev device for this attribute. vendor driver should then
> > >
> > > Compatible for what purpose? I think this is where specifically
> > > calling this a migration_version potentially has value.
> > yes. if it also covers version of vendor specific migration code, calling it
> > migration_version is more appropriate.
>
> I think we're discussing an interface that validates "I [the vendor
> driver] am able to import the state of this version", therefore it must
> include every relevant aspect of the vendor support for that.
>
> > > > + return written string's length if it regards the two mdev devices are
> > > > + compatible; vendor driver should return negative errno if it regards the two
> > > > + mdev devices are not compatible.
> > >
> > > IOW, the write(2) will succeed if the version is determined to be
> > > compatible and otherwise fail with vendor specific errno.
> > >
> > thanks:)
> >
> > > > +
> > > > + User space should treat ANY of below conditions as two mdev devices not
> > > > + compatible:
> > >
> > > (0) The mdev devices are not of the same type.
> > >
> > the same as above. do we also need to take aggregation count into consideration?
> >
> > > > + (1) any one of the two mdev devices does not have a version attribute
> > > > + (2) error when read from one mdev device's version attribute
> > >
> > > Is this intended to support that the vendor driver can supply a version
> > > attribute but not support migration? TBH, this sounds like a vendor
> > > driver bug, but maybe it's necessary if the vendor driver could have
> > > some types that support migration and others that do not? IOW, we're
> > > supplying the same attribute groups to all devices from a vendor, in
> > > which case my comment above regarding an example type without a version
> > > attribute might be invalid.
> > hmm, this is to make life easier for vendor driver to have some types that
> > support migration and others that do not. while we can get rid of returning
> > errno by providing different attribute groups to different devices, the way of
> > returning errno gives a simpler choice to vendors.
>
> Yes, I think it might be overly complicated to provide different
> attribute groups for different devices, we have more flexibility if the
> user does not make any assumptions based only on the presence of a
> version attribute.
>
> > > > + (3) error when write one mdev device's version string to the other mdev
> > > > + device's version attribute
> > > > +
> > > > + User space should regard two mdev devices compatible when ALL of below
> > > > + conditions are met:
> > >
> > > (0) The mdev devices are of the same type
> > >
> > > > + (1) success when read from one mdev device's version attribute.
> > > > + (2) success when write one mdev device's version string to the other mdev
> > > > + device's version attribute
> > > > +
> > > > + Errno:
> > > > + If vendor driver wants to claim a mdev device incompatible to all other mdev
> > > > + devices, it should not register version attribute for this mdev device. But if
> > > > + a vendor driver has already registered version attribute and it wants to claim
> > > > + a mdev device incompatible to all other mdev devices, it needs to return
> > > > + -ENODEV on access to this mdev device's version attribute.
> > > > + If a mdev device is only incompatible to certain mdev devices, write of
> > > > + incompatible mdev devices's version strings to its version attribute should
> > > > + return -EINVAL;
> > >
> > > I think it's best not to define the specific errno returned for a
> > > specific situation, let the vendor driver decide, userspace simply
> > > needs to know that an errno on read indicates the device does not
> > > support migration version comparison and that an errno on write
> > > indicates the devices are incompatible or the target doesn't support
> > > migration versions.
> > >
> > yes, user space only gets 0 or 1 as return code, not those errno.
> > maybe I only need to describe errno in patch 2/2.
> >
> > > > +
> > > > + This attribute can be taken advantage of by live migration.
> > > > + If user space detects two mdev devices are compatible through version
> > > > + attribute, it can start migration between the two mdev devices, otherwise it
> > > > + should abort its migration attempts between the two mdev devices.
> > > > +
> > > > + Example Usage:
> > > > + case 1:
> > > > + source side mdev device is of uuid 5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd,
> > > > + its mdev type is i915-GVTg_V5_4. pci id of parent device is 8086-193b.
> > > > + target side mdev device is if of uuid 882cc4da-dede-11e7-9180-078a62063ab1,
> > > > + its mdev type is i915-GVTg_V5_4. pci id of parent device is 8086-193b.
> > > > +
> > > > + # readlink /sys/bus/pci/devices/0000\:00\:02.0/\
> > > > + 5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd/mdev_type
> > > > + ../mdev_supported_types/i915-GVTg_V5_4
> > > > +
> > > > + # readlink /sys/bus/pci/devices/0000\:00\:02.0/\
> > > > + 882cc4da-dede-11e7-9180-078a62063ab1/mdev_type
> > > > + ../mdev_supported_types/i915-GVTg_V5_4
> > > > +
> > > > + (1) read source side mdev device's version.
> > > > + #cat \
> > > > + /sys/bus/pci/devices/0000\:00\:02.0/5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd/\
> > > > + mdev_type/version
> > > > + 8086-193b-i915-GVTg_V5_4
> > >
> > > Is this really the version information exposed in 2/2? This is opaque,
> > > so of course you can add things later, but it seems short sighted not
> > > to even append a version 0 tag to account for software compatibility
> > > differences since the above only represents a parent and mdev type
> > > based version.
> > >
> > yes, currently in 2/2, the version only includes <vendor id> + <device id> +
> > <mdev type>. but you are right, it's better to include software migration
> > version number.
> > so vendor drivers have below 3 ways to designate a mdev device has no migration
> > capability.
> > 1. not registering migration_version attribute
> > 2. on reading migration_version, returning errno
> > 3. on reading migration_version, returning string indicating non-migratable.
> >
> > The reason of not giving up way 2 is that maybe it can accelerate user space
> > getting information of device incompatible. if we only keep way 3, it would not
> > know this info until writing this string to target attribute.
> >
> > do you agree?
>
> The string is opaque to the user, so if you're asking in (3) that the
> user read and parse some information in the string that indicates the
> device is non-migratable then no, I don't agree with that policy. If
> reading the version attribute produces a result, the only thing the
> user can do with that result is to test it by writing it to another
> version attribute. Thanks,
>
> Alex

ok. so we will keep way 2 as a valid choice :)

Thanks
Yan
> > > > + (2) write source side mdev device's version string into target side mdev
> > > > + device's version attribute.
> > > > + # echo 8086-193b-i915-GVTg_V5_4 >
> > > > + /sys/bus/pci/devices/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/\
> > > > + mdev_type/version
> > > > + # echo $?
> > > > + 0
> > >
> > > TBH, there's a lot of superfluous information in this example that can
> > > be stripped out. For example:
> > >
> > > "
> > > (1) Compare mdev types:
> > >
> > > The mdev type of an instantiated device can be read from the mdev_type
> > > link within the device instance in sysfs, for example:
> > >
> > > # basename $(readlink -f /sys/bus/mdev/devices/$MDEV_UUID/mdev_type/)
> > >
> > > The mdev types available on a given host system can also be found
> > > through /sys/class/mdev_bus, for example:
> > >
> > > # ls /sys/class/mdev_bus/*/mdev_supported_types/
> > >
> > > Migration is only possible between devices of the same mdev type.
> > >
> > > (2) Retrieve the mdev source version:
> > >
> > > The migration version information can either be read from the mdev_type
> > > link on an instantiated device:
> > >
> > > # cat /sys/bus/mdev/devices/$UUID1/mdev_type/version
> > >
> > > Or it can be read from the mdev type definition, for example:
> > >
> > > # cat /sys/class/mdev_bus/*/mdev_supported_types/$MDEV_TYPE/version
> > >
> > > If reading the source version generates an error, migration is not
> > > possible. NB, there might be several parent devices for a given mdev
> > > type on a host system, each may support or expose different versions.
> > > Matching the specific mdev type to a parent may become important in
> > > such configurations.
> > >
> > > (3) Test source version at target:
> > >
> > > Given a version as outlined above, its compatibility to an instantiated
> > > device of the same mdev type can be tested as:
> > >
> > > # echo $VERSION > /sys/bus/mdev/devices/$UUID2/mdev_type/version
> > >
> > > If this write fails, the source and target versions are not compatible
> > > or the target does not support migration.
> > >
> > > Compatibility can also be tested prior to target device creation using
> > > the mdev type definition for a parent device with a previously found
> > > matching mdev type, for example:
> > >
> > > # echo $VERSION > /sys/class/mdev_bus/$PARENT/mdev_supported_types/$MDEV_TYPE/version
> > >
> > > Again, an error writing the version indicates that an instance of this
> > > mdev type would not support a migration from the provided version.
> > > "
> > >
> > > In particular from the provided example, the specific UUIDs, mdev
> > > types, parent information, and contents of the version attribute do not
> > > contribute to illustrating the protocol. In fact, displaying the
> > > contents of the version attribute may tempt users to do comparison on
> > > their own, especially given how easy it is to decide the GVT-g version
> > > string.
> > >
> > got it!
> > great thanks!
> > I'll update it to the next revision.
> > >
> > > > +
> > > > + in this case, user space's write to target side mdev device's version
> > > > + attribute returns success to indicate the two mdev devices are compatible.
> > > > +
> > > > + case 2:
> > > > + source side mdev device is of uuid 5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd,
> > > > + its mdev type is i915-GVTg_V5_4. pci id of parent device is 8086-193b.
> > > > + target side mdev device is if of uuid 882cc4da-dede-11e7-9180-078a62063ab1,
> > > > + its mdev type is i915-GVTg_V5_4. pci id of parent device is 8086-191b.
> > > > +
> > > > + # readlink /sys/bus/pci/devices/0000\:00\:02.0/\
> > > > + 5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd/mdev_type
> > > > + ../mdev_supported_types/i915-GVTg_V5_4
> > > > +
> > > > + # readlink /sys/bus/pci/devices/0000\:00\:02.0/\
> > > > + 882cc4da-dede-11e7-9180-078a62063ab1/mdev_type
> > > > + ../mdev_supported_types/i915-GVTg_V5_4
> > > > +
> > > > + (1) read source side mdev device's version.
> > > > + #cat \
> > > > + /sys/bus/pci/devices/0000\:00\:02.0/5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd/\
> > > > + mdev_type/version
> > > > + 8086-193b-i915-GVTg_V5_4
> > > > +
> > > > + (2) write source side mdev device's version string into target side mdev
> > > > + device's version attribute.
> > > > + # echo 8086-193b-i915-GVTg_V5_4 >
> > > > + /sys/bus/pci/devices/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/\
> > > > + mdev_type/version
> > > > + -bash: echo: write error: Invalid argument
> > > > +
> > > > + in this case, user space's write to target side mdev device's version
> > > > + attribute returns error to indicate the two mdev devices are incompatible.
> > > > + (incompatible because pci ids of the two mdev devices' parent devices are
> > > > + different).
> > > > +
> > > > + case 3:
> > > > + source side mdev device is of uuid 5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd,
> > > > + its mdev type is i915-GVTg_V5_4. pci id of parent device is 8086-193b.
> > > > + But vendor driver does not provide version attribute for this device.
> > > > +
> > > > + (1) read source side mdev device's version.
> > > > + #cat \
> > > > + /sys/bus/pci/devices/0000\:00\:02.0/5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd/\
> > > > + mdev_type/version
> > > > + cat: '/sys/bus/pci/devices/0000:00:02.0/5ac1fb20-2bbf-4842-bb7e-36c58c3be9cd/\
> > > > + mdev_type/version': No such file or directory
> > > > +
> > > > + in this case, user space reads source side mdev device's version attribute
> > > > + which does not exist however. user space regards the two mdev devices as not
> > > > + compatible and will not start migration between the two mdev devices.
> > > > +
> > > > +
> > >
> > > This is far too long for description and examples, it's not this
> > > complicated. Thanks,
> > >
> > got it. I'll follow your above example :)
> >
> > thanks
> > Yan
> > > > * [device]
> > > >
> > > > This directory contains links to the devices of type <type-id> that have been
> > >
>