RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

From: Tian, Kevin
Date: Fri Apr 23 2021 - 06:31:54 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Friday, April 23, 2021 7:40 AM
>
> On Thu, Apr 22, 2021 at 04:38:08PM -0600, Alex Williamson wrote:
>
> > Because it's fundamental to the isolation of the device? What you're
> > proposing doesn't get around the group issue, it just makes it implicit
> > rather than explicit in the uapi.
>
> I'm not even sure it makes it explicit or implicit, it just takes away
> the FD.
>
> There are four group IOCTLs, I see them mapping to /dev/ioasid follows:
> VFIO_GROUP_GET_STATUS -
> + VFIO_GROUP_FLAGS_CONTAINER_SET is fairly redundant
> + VFIO_GROUP_FLAGS_VIABLE could be in a new sysfs under
> kernel/iomm_groups, or could be an IOCTL on /dev/ioasid
> IOASID_ALL_DEVICES_VIABLE
>
> VFIO_GROUP_SET_CONTAINER -
> + This happens implicitly when the device joins the IOASID
> so it gets moved to the vfio_device FD:
> ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
>
> VFIO_GROUP_UNSET_CONTAINER -
> + Also moved to the vfio_device FD, opposite of JOIN_IOASID_FD
>
> VFIO_GROUP_GET_DEVICE_FD -
> + Replaced by opening /dev/vfio/deviceX
> Learn the deviceX which will be the cdev sysfs shows as:
> /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/vfio/deviceX/dev
> Open /dev/vfio/deviceX
>
> > > How do we model the VFIO group security concept to something like
> > > VDPA?
> >
> > Is it really a "VFIO group security concept"? We're reflecting the
> > reality of the hardware, not all devices are fully isolated.
>
> Well, exactly.
>
> /dev/ioasid should understand the group concept somehow, otherwise it
> is incomplete and maybe even security broken.
>
> So, how do I add groups to, say, VDPA in a way that makes sense? The
> only answer I come to is broadly what I outlined here - make
> /dev/ioasid do all the group operations, and do them when we enjoin
> the VDPA device to the ioasid.
>
> Once I have solved all the groups problems with the non-VFIO users,
> then where does that leave VFIO? Why does VFIO need a group FD if
> everyone else doesn't?
>
> > IOMMU group. This is the reality that any userspace driver needs to
> > play in, it doesn't magically go away because we drop the group file
> > descriptor.
>
> I'm not saying it does, I'm saying it makes the uAPI more regular and
> easier to fit into /dev/ioasid without the group FD.
>
> > It only makes the uapi more difficult to use correctly because
> > userspace drivers need to go outside of the uapi to have any idea
> > that this restriction exists.
>
> I don't think it makes any substantive difference one way or the
> other.
>
> With the group FD: the userspace has to read sysfs, find the list of
> devices in the group, open the group fd, create device FDs for each
> device using the name from sysfs.
>
> Starting from a BDF the general pseudo code is
> group_path = readlink("/sys/bus/pci/devices/BDF/iommu_group")
> group_name = basename(group_path)
> group_fd = open("/dev/vfio/"+group_name)
> device_fd = ioctl(VFIO_GROUP_GET_DEVICE_FD, BDF);
>
> Without the group FD: the userspace has to read sysfs, find the list
> of devices in the group and then open the device-specific cdev (found
> via sysfs) and link them to a /dev/ioasid FD.
>
> Starting from a BDF the general pseudo code is:
> device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> device_fd = open("/dev/vfio/"+device_name)
> ioasidfd = open("/dev/ioasid")
> ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
>
> These two routes can have identical outcomes and identical security
> checks.
>
> In both cases if userspace wants a list of BDFs in the same group as
> the BDF it is interested in:
> readdir("/sys/bus/pci/devices/BDF/iommu_group/devices")
>
> It seems like a very small difference to me.
>
> I still don't see how the group restriction gets surfaced to the
> application through the group FD. The applications I looked through
> just treat the group FD as a step on their way to get the device_fd.
>

So your proposal sort of moves the entire container/group/domain
managment into /dev/ioasid and then leaves vfio only provide device
specific uAPI. An ioasid represents a page table (address space), thus
is equivalent to the scope of VFIO container. Having the device join
an ioasid is equivalent to attaching a device to VFIO container, and
here the group integrity must be enforced. Then /dev/ioasid anyway
needs to manage group objects and their association with ioasid and
underlying iommu domain thus it's pointless to keep same logic within
VFIO. Is this understanding correct?

btw one remaining open is whether you expect /dev/ioasid to be
associated with a single iommu domain, or multiple. If only a single
domain is allowed, the ioasid_fd is equivalent to the scope of VFIO
container. It is supposed to have only one gpa_ioasid_id since one
iommu domain can only have a single 2nd level pgtable. Then all other
ioasids, once allocated, must be nested on this gpa_ioasid_id to fit
in the same domain. if a legacy vIOMMU is exposed (which disallows
nesting), the userspace has to open an ioasid_fd for every group.
This is basically the VFIO way. On the other hand if multiple domains
is allowed, there could be multiple ioasid_ids each holding a 2nd level
pgtable and an iommu domain (or a list of pgtables and domains due to
incompatibility issue as discussed in another thread), and can be
nested by other ioasids respectively. The application only needs
to open /dev/ioasid once regardless of whether vIOMMU allows
nesting, and has a single interface for ioasid allocation. Which way
do you prefer to?

Thanks
Kevin