Re: [RFC] /dev/ioasid uAPI proposal

From: Jason Gunthorpe
Date: Fri May 28 2021 - 19:36:58 EST


On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
>
> /*
> * Check whether an uAPI extension is supported.
> *
> * This is for FD-level capabilities, such as locked page pre-registration.
> * IOASID-level capabilities are reported through IOASID_GET_INFO.
> *
> * Return: 0 if not supported, 1 if supported.
> */
> #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)


> /*
> * Register user space memory where DMA is allowed.
> *
> * It pins user pages and does the locked memory accounting so sub-
> * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> *
> * When this ioctl is not used, one user page might be accounted
> * multiple times when it is mapped by multiple IOASIDs which are
> * not nested together.
> *
> * Input parameters:
> * - vaddr;
> * - size;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 2)

So VA ranges are pinned and stored in a tree and later references to
those VA ranges by any other IOASID use the pin cached in the tree?

It seems reasonable and is similar to the ioasid parent/child I
suggested for PPC.

IMHO this should be merged with the all SW IOASID that is required for
today's mdev drivers. If this can be done while keeping this uAPI then
great, otherwise I don't think it is so bad to weakly nest a physical
IOASID under a SW one just to optimize page pinning.

Either way this seems like a smart direction

> /*
> * Allocate an IOASID.
> *
> * IOASID is the FD-local software handle representing an I/O address
> * space. Each IOASID is associated with a single I/O page table. User
> * must call this ioctl to get an IOASID for every I/O address space that is
> * intended to be enabled in the IOMMU.
> *
> * A newly-created IOASID doesn't accept any command before it is
> * attached to a device. Once attached, an empty I/O page table is
> * bound with the IOMMU then the user could use either DMA mapping
> * or pgtable binding commands to manage this I/O page table.

Can the IOASID can be populated before being attached?

> * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> *
> * Return: allocated ioasid on success, -errno on failure.
> */
> #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)

I assume alloc will include quite a big structure to satisfy the
various vendor needs?

> /*
> * Get information about an I/O address space
> *
> * Supported capabilities:
> * - VFIO type1 map/unmap;
> * - pgtable/pasid_table binding
> * - hardware nesting vs. software nesting;
> * - ...
> *
> * Related attributes:
> * - supported page sizes, reserved IOVA ranges (DMA mapping);
> * - vendor pgtable formats (pgtable binding);
> * - number of child IOASIDs (nesting);
> * - ...
> *
> * Above information is available only after one or more devices are
> * attached to the specified IOASID. Otherwise the IOASID is just a
> * number w/o any capability or attribute.

This feels wrong to learn most of these attributes of the IOASID after
attaching to a device.

The user should have some idea how it intends to use the IOASID when
it creates it and the rest of the system should match the intention.

For instance if the user is creating a IOASID to cover the guest GPA
with the intention of making children it should indicate this during
alloc.

If the user is intending to point a child IOASID to a guest page table
in a certain descriptor format then it should indicate it during
alloc.

device bind should fail if the device somehow isn't compatible with
the scheme the user is tring to use.

> /*
> * Map/unmap process virtual addresses to I/O virtual addresses.
> *
> * Provide VFIO type1 equivalent semantics. Start with the same
> * restriction e.g. the unmap size should match those used in the
> * original mapping call.
> *
> * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> * must be already in the preregistered list.
> *
> * Input parameters:
> * - u32 ioasid;
> * - refer to vfio_iommu_type1_dma_{un}map
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)

What about nested IOASIDs?

> /*
> * Create a nesting IOASID (child) on an existing IOASID (parent)
> *
> * IOASIDs can be nested together, implying that the output address
> * from one I/O page table (child) must be further translated by
> * another I/O page table (parent).
> *
> * As the child adds essentially another reference to the I/O page table
> * represented by the parent, any device attached to the child ioasid
> * must be already attached to the parent.
> *
> * In concept there is no limit on the number of the nesting levels.
> * However for the majority case one nesting level is sufficient. The
> * user should check whether an IOASID supports nesting through
> * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> * the nesting capability is reported only on the parent instead of the
> * child.
> *
> * User also needs check (via IOASID_GET_INFO) whether the nesting
> * is implemented in hardware or software. If software-based, DMA
> * mapping protocol should be used on the child IOASID. Otherwise,
> * the child should be operated with pgtable binding protocol.
> *
> * Input parameters:
> * - u32 parent_ioasid;
> *
> * Return: child_ioasid on success, -errno on failure;
> */
> #define IOASID_CREATE_NESTING _IO(IOASID_TYPE, IOASID_BASE + 8)

Do you think another ioctl is best? Should this just be another
parameter to alloc?

> /*
> * Bind an user-managed I/O page table with the IOMMU
> *
> * Because user page table is untrusted, IOASID nesting must be enabled
> * for this ioasid so the kernel can enforce its DMA isolation policy
> * through the parent ioasid.
> *
> * Pgtable binding protocol is different from DMA mapping. The latter
> * has the I/O page table constructed by the kernel and updated
> * according to user MAP/UNMAP commands. With pgtable binding the
> * whole page table is created and updated by userspace, thus different
> * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> *
> * Because the page table is directly walked by the IOMMU, the user
> * must use a format compatible to the underlying hardware. It can
> * check the format information through IOASID_GET_INFO.
> *
> * The page table is bound to the IOMMU according to the routing
> * information of each attached device under the specified IOASID. The
> * routing information (RID and optional PASID) is registered when a
> * device is attached to this IOASID through VFIO uAPI.
> *
> * Input parameters:
> * - child_ioasid;
> * - address of the user page table;
> * - formats (vendor, address_width, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)

Also feels backwards, why wouldn't we specify this, and the required
page table format, during alloc time?

> /*
> * Bind an user-managed PASID table to the IOMMU
> *
> * This is required for platforms which place PASID table in the GPA space.
> * In this case the specified IOASID represents the per-RID PASID space.
> *
> * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> * special flag to indicate the difference from normal I/O address spaces.
> *
> * The format info of the PASID table is reported in IOASID_GET_INFO.
> *
> * As explained in the design section, user-managed I/O page tables must
> * be explicitly bound to the kernel even on these platforms. It allows
> * the kernel to uniformly manage I/O address spaces cross all platforms.
> * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> * to carry device routing information to indirectly mark the hidden I/O
> * address spaces.
> *
> * Input parameters:
> * - child_ioasid;
> * - address of PASID table;
> * - formats (vendor, size, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 12)

Ditto

>
> /*
> * Invalidate IOTLB for an user-managed I/O page table
> *
> * Unlike what's defined in include/uapi/linux/iommu.h, this command
> * doesn't allow the user to specify cache type and likely support only
> * two granularities (all, or a specified range) in the I/O address space.
> *
> * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> * cache). If the IOASID represents an I/O address space, the invalidation
> * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> * represents a vPASID space, then this command applies to the PASID
> * cache.
> *
> * Similarly this command doesn't provide IOMMU-like granularity
> * info (domain-wide, pasid-wide, range-based), since it's all about the
> * I/O address space itself. The ioasid driver walks the attached
> * routing information to match the IOMMU semantics under the
> * hood.
> *
> * Input parameters:
> * - child_ioasid;
> * - granularity
> *
> * Return: 0 on success, -errno on failure
> */
> #define IOASID_INVALIDATE_CACHE _IO(IOASID_TYPE, IOASID_BASE + 13)

This should have an IOVA range too?

> /*
> * Page fault report and response
> *
> * This is TBD. Can be added after other parts are cleared up. Likely it
> * will be a ring buffer shared between user/kernel, an eventfd to notify
> * the user and an ioctl to complete the fault.
> *
> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> */

Any reason not to just use read()?

>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++

To be clear you mean the 'struct vfio_device' API, these are not
IOCTLs on the container or group?

> /*
> * Bind a vfio_device to the specified IOASID fd
> *
> * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> * vfio device should not be bound to multiple ioasid_fd's.
> *
> * Input parameters:
> * - ioasid_fd;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)

This is where it would make sense to have an output "device id" that
allows /dev/ioasid to refer to this "device" by number in events and
other related things.

>
> 2.3. KVM uAPI
> ++++++++++++
>
> /*
> * Update CPU PASID mapping
> *
> * This is necessary when ENQCMD will be used in the guest while the
> * targeted device doesn't accept the vPASID saved in the CPU MSR.
> *
> * This command allows user to set/clear the vPASID->pPASID mapping
> * in the CPU, by providing the IOASID (and FD) information representing
> * the I/O address space marked by this vPASID.
> *
> * Input parameters:
> * - user_pasid;
> * - ioasid_fd;
> * - ioasid;
> */
> #define KVM_MAP_PASID _IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)

It seems simple enough.. So the physical PASID can only be assigned if
the user has an IOASID that points at it? Thus it is secure?

> 3. Sample structures and helper functions
>
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
>
> struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> int ioasid_unregister_device(struct ioasid_dev *dev);
>
> An ioasid_ctx is created for each fd:
>
> struct ioasid_ctx {
> // a list of allocated IOASID data's
> struct list_head ioasid_list;

Would expect an xarray

> // a list of registered devices
> struct list_head dev_list;

xarray of device_id

> // a list of pre-registered virtual address ranges
> struct list_head prereg_list;

Should re-use the existing SW IOASID table, and be an interval tree.

> Each registered device is represented by ioasid_dev:
>
> struct ioasid_dev {
> struct list_head next;
> struct ioasid_ctx *ctx;
> // always be the physical device
> struct device *device;
> struct kref kref;
> };
>
> Because we assume one vfio_device connected to at most one ioasid_fd,
> here ioasid_dev could be embedded in vfio_device and then linked to
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.

Don't embed a struct like this in something with vfio_device - that
just makes a mess of reference counting by having multiple krefs in
the same memory block. Keep it as a pointer, the attach operation
should return a pointer to the above struct.

> An ioasid_data is created when IOASID_ALLOC, as the main object
> describing characteristics about an I/O page table:
>
> struct ioasid_data {
> // link to ioasid_ctx->ioasid_list
> struct list_head next;
>
> // the IOASID number
> u32 ioasid;
>
> // the handle to convey iommu operations
> // hold the pgd (TBD until discussing iommu api)
> struct iommu_domain *domain;

But at least for the first coding draft I would expect to see this API
presented with no PASID support and a simple 1:1 with iommu_domain. How
PASID gets modeled is the big TBD, right?

> ioasid_data and iommu_domain have overlapping roles as both are
> introduced to represent an I/O address space. It is still a big TBD how
> the two should be corelated or even merged, and whether new iommu
> ops are required to handle RID+PASID explicitly.

I think it is OK that the uapi and kernel api have different
structs. The uapi focused one should hold the uapi related data, which
is what you've shown here, I think.

> Two helper functions are provided to support VFIO_ATTACH_IOASID:
>
> struct attach_info {
> u32 ioasid;
> // If valid, the PASID to be used physically
> u32 pasid;
> };
> int ioasid_device_attach(struct ioasid_dev *dev,
> struct attach_info info);
> int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

Honestly, I still prefer this to be highly explicit as this is where
all device driver authors get invovled:

ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev, u32 ioasid);
ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

And presumably a variant for ARM non-PCI platform (?) devices.

This could boil down to a __ioasid_device_attach() as you've shown.

> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> struct ioasid_attach_data {
> struct list_head next;
> struct ioasid_dev *dev;
> u32 pasid;
> }

This should be returned as a pointer and detatch should be:

int ioasid_device_detach(struct ioasid_attach_data *);

> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.

It is simple enough. Would be good to design in a diagnostic string so
userspace can make sense of the failure. Eg return something like
-EDEADLK and provide an ioctl 'why did EDEADLK happen' ?


> Then is the last helper function:
> u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> u32 ioasid, bool alloc);
>
> ioasid_get_global_pasid is necessary in scenarios where multiple devices
> want to share a same PASID value on the attached I/O page table (e.g.
> when ENQCMD is enabled, as explained in next section). We need a
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation
> structure when user calls KVM_MAP_PASID.

When/why would the VFIO driver do this? isn't this just some varient
of pasid_attach?

ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

?

> 4. PASID Virtualization
>
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> created on the assigned vfio device. This leads to the concepts of
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> by the guest to mark an GVA address space while pPASID is the one
> selected by the host and actually routed in the wire.
>
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

Should the vPASID programmed into the IOASID before calling
VFIO_ATTACH_IOASID?

> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
>
> - Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> should be instead converted to a newly-allocated one (vPASID!=
> pPASID);
>
> - If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> space or a global PASID space (implying sharing pPASID cross devices,
> e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> as part of the process context);

This whole section is 4 really confusing. I think it would be more
understandable to focus on the list below and minimize the vPASID

> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
>
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> policies.)

This has become unclear. I think this should start by identifying the
6 main type of devices and how they can use pPASID/vPASID:

0) Device is a RID and cannot issue PASID
1) Device is a mdev and cannot issue PASID
2) Device is a mdev and programs a single fixed PASID during bind,
does not accept PASID from the guest

3) Device accepts any PASIDs from the guest. No
vPASID/pPASID translation is possible. (classic vfio_pci)
4) Device accepts any PASID from the guest and has an
internal vPASID/pPASID translation (enhanced vfio_pci)
5) Device accepts and PASID from the guest and relys on
external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)

0-2 don't use vPASID at all

3-5 consume a vPASID but handle it differently.

I think the 3-5 map into what you are trying to explain in the table
below, which is the rules for allocating the vPASID depending on which
of device types 3-5 are present and or mixed.

For instance device type 3 requires vPASID == pPASID because it can't
do translation at all.

This probably all needs to come through clearly in the /dev/ioasid
interface. Once the attached devices are labled it would make sense to
have a 'query device' /dev/ioasid IOCTL to report the details based on
how the device attached and other information.

> 2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
>
> PASIDs are also used by kernel to mark the default I/O address space
> for mdev, thus cannot be delegated to the guest. Instead, the mdev
> driver must allocate a new pPASID for each vPASID (thus vPASID!=
> pPASID) and then use pPASID when attaching this mdev to an ioasid.

I don't understand this at all.. What does "PASIDs are also used by
the kernel" mean?

> The mdev driver needs cache the PASID mapping so in mediation
> path vPASID programmed by the guest can be converted to pPASID
> before updating the physical MMIO register.

This is my scenario #4 above. Device and internally virtualize
vPASID/pPASID - how that is done is up to the device. But this is all
just labels, when such a device attaches, it should use some specific
API:

ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

And then maintain its internal translation

> In previous thread a PASID range split scheme was discussed to support
> this combination, but we haven't worked out a clean uAPI design yet.
> Therefore in this proposal we decide to not support it, implying the
> user should have some intelligence to avoid such scenario. It could be
> a TODO task for future.

It really just boils down to how to allocate the PASIDs to get around
the bad viommu interface that assumes all PASIDs are usable by all
devices.

> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
>
> - v==p for pdev;
> - v!=p and always use a global PASID pool for all mdev's;

Regardless all this mess needs to be hidden from the consuming drivers
with some simple APIs as above. The driver should indicate what its HW
can do and the PASID #'s that magically come out of /dev/ioasid should
be appropriate.

Will resume on another email..

Jason