Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

From: Jacob Pan
Date: Tue Sep 01 2020 - 12:49:46 EST


Hi Eric,

On Thu, 27 Aug 2020 18:21:07 +0200
Auger Eric <eric.auger@xxxxxxxxxx> wrote:

> Hi Jacob,
> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:
> > On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
> >> IOASID is used to identify address spaces that can be targeted by
> >> device DMA. It is a system-wide resource that is essential to its
> >> many users. This document is an attempt to help developers from
> >> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s
> >> Scalable IO Virtualization (SIOV) enabled platforms are the
> >> primary users of IOASID. Examples of how SIOV components interact
> >> with IOASID APIs are provided in that many APIs are driven by the
> >> requirements from SIOV.
> >>
> >> Signed-off-by: Liu Yi L <yi.l.liu@xxxxxxxxx>
> >> Signed-off-by: Wu Hao <hao.wu@xxxxxxxxx>
> >> Signed-off-by: Jacob Pan <jacob.jun.pan@xxxxxxxxxxxxxxx>
> >> ---
> >> Documentation/ioasid.rst | 618
> >> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed,
> >> 618 insertions(+) create mode 100644 Documentation/ioasid.rst
> >>
> >> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
> >
> > Thanks for writing this up. Should it go to
> > Documentation/driver-api/, or Documentation/driver-api/iommu/? I
> > think this also needs to Cc linux-doc@xxxxxxxxxxxxxxx and
> > corbet@xxxxxxx
> >> new file mode 100644
> >> index 000000000000..b6a8cdc885ff
> >> --- /dev/null
> >> +++ b/Documentation/ioasid.rst
> >> @@ -0,0 +1,618 @@
> >> +.. ioasid:
> >> +
> >> +=====================================
> >> +IO Address Space ID
> >> +=====================================
> >> +
> >> +IOASID is a generic name for PCIe Process Address ID (PASID) or
> >> ARM +SMMU sub-stream ID. An IOASID identifies an address space
> >> that DMA
> >
> > "SubstreamID"
> On ARM if we don't use PASIDs we have streamids (SID) which can also
> identify address spaces that DMA requests can target. So maybe this
> definition is not sufficient.
>
According to SMMU spec, the SubstreamID is equivalent to PASID. My
understanding is that SID is equivalent to PCI requester ID that
identifies stage 2. Do you plan to use IOASID for stage 2?
IOASID is mostly for SVA and DMA request w/ PASID.

> >
> >> +requests can target.
> >> +
> >> +The primary use cases for IOASID are Shared Virtual Address (SVA)
> >> and +IO Virtual Address (IOVA). However, the requirements for
> >> IOASID
> >
> > IOVA alone isn't a use case, maybe "multiple IOVA spaces per
> > device"?
> >> +management can vary among hardware architectures.
> >> +
> >> +This document covers the generic features supported by IOASID
> >> +APIs. Vendor-specific use cases are also illustrated with Intel's
> >> VT-d +based platforms as the first example.
> >> +
> >> +.. contents:: :local:
> >> +
> >> +Glossary
> >> +========
> >> +PASID - Process Address Space ID
> >> +
> >> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> >> +sub-stream ID in SMMU)
> >
> > "SubstreamID"
> >
> >> +
> >> +SVA/SVM - Shared Virtual Addressing/Memory
> >> +
> >> +ENQCMD - New Intel X86 ISA for efficient workqueue submission
> >> [1]
> >
> > Maybe drop the "New", to keep the documentation perennial. It might
> > be good to add internal links here to the specifications URLs at
> > the bottom.
> >> +
> >> +DSA - Intel Data Streaming Accelerator [2]
> >> +
> >> +VDCM - Virtual device composition module [3]
> >> +
> >> +SIOV - Intel Scalable IO Virtualization
> >> +
> >> +
> >> +Key Concepts
> >> +============
> >> +
> >> +IOASID Set
> >> +-----------
> >> +An IOASID set is a group of IOASIDs allocated from the system-wide
> >> +IOASID pool. An IOASID set is created and can be identified by a
> >> +token of u64. Refer to IOASID set APIs for more details.
> >
> > Identified either by an u64 or an mm_struct, right? Maybe just
> > drop the second sentence if it's detailed in the IOASID set section
> > below.
> >> +
> >> +IOASID set is particularly useful for guest SVA where each guest
> >> could +have its own IOASID set for security and efficiency reasons.
> >> +
> >> +IOASID Set Private ID (SPID)
> >> +----------------------------
> >> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to
> >> a +system-wide IOASID but the namespace of SPID is within its
> >> IOASID +set.
> >
> > The intro isn't super clear. Perhaps this is simpler:
> > "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> > single system-wide IOASID."
> or, "within an ioasid set, each ioasid can be associated with an alias
> ID, named SPID."
I don't have strong opinion, I feel it is good to explain the
relationship between SPID and IOASID in both directions, how about add?
" Conversely, each IOASID is associated with an alias ID, named SPID."

> >
> >> SPIDs can be used as guest IOASIDs where each guest could do
> >> +IOASID allocation from its own pool and map them to host physical
> >> +IOASIDs. SPIDs are particularly useful for supporting live
> >> migration +where decoupling guest and host physical resources are
> >> necessary. +
> >> +For example, two VMs can both allocate guest PASID/SPID #101 but
> >> map to +different host PASIDs #201 and #202 respectively as shown
> >> in the +diagram below.
> >> +::
> >> +
> >> + .------------------. .------------------.
> >> + | VM 1 | | VM 2 |
> >> + | | | |
> >> + |------------------| |------------------|
> >> + | GPASID/SPID 101 | | GPASID/SPID 101 |
> >> + '------------------' -------------------' Guest
> >> + __________|______________________|______________________
> >> + | | Host
> >> + v v
> >> + .------------------. .------------------.
> >> + | Host IOASID 201 | | Host IOASID 202 |
> >> + '------------------' '------------------'
> >> + | IOASID set 1 | | IOASID set 2 |
> >> + '------------------' '------------------'
> >> +
> >> +Guest PASID is treated as IOASID set private ID (SPID) within an
> >> +IOASID set, mappings between guest and host IOASIDs are stored in
> >> the +set for inquiry.
> >> +
> >> +IOASID APIs
> >> +===========
> >> +To get the IOASID APIs, users must #include <linux/ioasid.h>.
> >> These APIs +serve the following functionalities:
> >> +
> >> + - IOASID allocation/Free
> >> + - Group management in the form of ioasid_set
> >> + - Private data storage and lookup
> >> + - Reference counting
> >> + - Event notification in case of state change
> (a)
got it

> >> +
> >> +IOASID Set Level APIs
> >> +--------------------------
> >> +For use cases such as guest SVA it is necessary to manage IOASIDs
> >> at +a group level. For example, VMs may allocate multiple IOASIDs
> >> for
> I would use the introduced ioasid_set terminology instead of "group".
Right, we already introduced it.

> >> +guest process address sharing (vSVA). It is imperative to enforce
> >> +VM-IOASID ownership such that malicious guest cannot target DMA
> >
> > "a malicious guest"
> >
> >> +traffic outside its own IOASIDs, or free an active IOASID belong
> >> to
> >
> > "that belongs to"
> >
> >> +another VM.
> >> +::
> >> +
> >> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> >> u32 type)
> what is this void *token? also the type may be explained here.
token is explained in the text following API list. I can move it up.

> >> +
> >> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
> >
> > These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to
> > be consistent with the rest of the API.
> >
> >> +
> >> + void ioasid_set_get(struct ioasid_set *set)
> >> +
> >> + void ioasid_set_put(struct ioasid_set *set)
> >> +
> >> + void ioasid_set_get_locked(struct ioasid_set *set)
> >> +
> >> + void ioasid_set_put_locked(struct ioasid_set *set)
> >> +
> >> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> >
> > Might be nicer to keep the same argument names within the API. Here
> > "set" rather than "sdata".
> >
> >> + void (*fn)(ioasid_t id, void
> >> *data),
> >> + void *data)
> >
> > (alignment)
> >
> >> +
> >> +
> >> +IOASID set concept is introduced to represent such IOASID groups.
> >> Each
> >
> > Or just "IOASID sets represent such IOASID groups", but might be
> > redundant.
> >
> >> +IOASID set is created with a token which can be one of the
> >> following +types:
> I think this explanation should happen before the above function
> prototypes
ditto.

> >> +
> >> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> >> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> >> +
> >> +The explicit MM token type is useful when multiple users of an
> >> IOASID +set under the same process need to communicate about their
> >> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
> >> can be associated +with the KVM instance for the same guest since
> >> they share a common mm_struct. +
> >> +The IOASID set APIs serve the following purposes:
> >> +
> >> + - Ownership/permission enforcement
> >> + - Take collective actions, e.g. free an entire set
> >> + - Event notifications within a set
> >> + - Look up a set based on token
> >> + - Quota enforcement
> >
> > This paragraph could be earlier in the section
>
> yes this is a kind of repetition of (a), above
I meant to highlight on what the APIs do such that readers don't
need to read the code instead.

> >
> >> +
> >> +Individual IOASID APIs
> >> +----------------------
> >> +Once an ioasid_set is created, IOASIDs can be allocated from the
> >> set. +Within the IOASID set namespace, set private ID (SPID) is
> >> supported. In +the VM use case, SPID can be used for storing guest
> >> PASID. +
> >> +::
> >> +
> >> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> >> ioasid_t max,
> >> + void *private);
> >> +
> >> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> >> + bool (*getter)(void *));
> >> +
> >> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
> >> spid) +
> >> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> >> + void *data);
> >> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> >> + ioasid_t ssid);
> >
> > s/ssid/spid>
got it

> >> +
> >> +
> >> +Notifications
> >> +-------------
> >> +An IOASID may have multiple users, each user may have hardware
> >> context +associated with an IOASID. When the status of an IOASID
> >> changes, +e.g. an IOASID is being freed, users need to be notified
> >> such that the +associated hardware context can be cleared,
> >> flushed, and drained. +
> >> +::
> >> +
> >> + int ioasid_register_notifier(struct ioasid_set *set, struct
> >> + notifier_block *nb)
> >> +
> >> + void ioasid_unregister_notifier(struct ioasid_set *set,
> >> + struct notifier_block *nb)
> >> +
> >> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> >> + notifier_block *nb)
> >> +
> >> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> >> + notifier_block *nb)
> the mm_struct prototypes may be justified
This is the mm type token, i.e.
- IOASID_SET_TYPE_MM (Set token is a mm_struct)
I am not sure if it is better to keep the explanation in code or in
this document, certainly don't want to duplicate.

> >> +
> >> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> >> + unsigned int flags)
> this one is not obvious either.
Here I just wanted to list the API functions, perhaps readers can check
out the code comments?

> >> +
> >> +
> >> +Events
> >> +~~~~~~
> >> +Notification events are pertinent to individual IOASIDs, they can
> >> be +one of the following:
> >> +
> >> + - ALLOC
> >> + - FREE
> >> + - BIND
> >> + - UNBIND
> >> +
> >> +Ordering
> >> +~~~~~~~~
> >> +Ordering is supported by IOASID notification priorities as the
> >> +following (in ascending order):
> >> +
> >> +::
> >> +
> >> + enum ioasid_notifier_prios {
> >> + IOASID_PRIO_LAST,
> >> + IOASID_PRIO_IOMMU,
> >> + IOASID_PRIO_DEVICE,
> >> + IOASID_PRIO_CPU,
> >> + };
>
> Maybe:
> when registered, notifiers are assigned a priority that affect the
> call order. Notifiers with CPU priority get called before notifiers
> with device priority and so on.
Sounds good.

> >> +
> >> +The typical use case is when an IOASID is freed due to an
> >> exception, DMA +source should be quiesced before tearing down
> >> other hardware contexts +in the system. This will reduce the churn
> >> in handling faults. DMA work +submission is performed by the CPU
> >> which is granted higher priority than +devices.
> >> +
> >> +
> >> +Scopes
> >> +~~~~~~
> >> +There are two types of notifiers in IOASID core: system-wide and
> >> +ioasid_set-wide.
> >> +
> >> +System-wide notifier is catering for users that need to handle all
> >> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
> >> +
> >> +Per ioasid_set notifier can be used by VM specific components
> >> such as +KVM. After all, each KVM instance only cares about
> >> IOASIDs within its +own set.
> >> +
> >> +
> >> +Atomicity
> >> +~~~~~~~~~
> >> +IOASID notifiers are atomic due to spinlocks used inside the
> >> IOASID +core. For tasks cannot be completed in the notifier
> >> handler, async work
> >
> > "tasks that cannot be"
> >
> >> +can be submitted to complete the work later as long as there is no
> >> +ordering requirement.
> >> +
> >> +Reference counting
> >> +------------------
> >> +IOASID lifecycle management is based on reference counting. Users
> >> of +IOASID intend to align lifecycle with the IOASID need to hold
> >
> > "who intend to"
> >
> >> +reference of the IOASID. IOASID will not be returned to the pool
> >> for
> >
> > "a reference to the IOASID. The IOASID"
> >
> >> +allocation until all references are dropped. Calling ioasid_free()
> >> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> >> +reference. ioasid_get() is not allowed once an IOASID is in the
> >> +FREE_PENDING state.
> >> +
> >> +Event notifications are used to inform users of IOASID status
> >> change. +IOASID_FREE event prompts users to drop their references
> >> after +clearing its context.
> >> +
> >> +For example, on VT-d platform when an IOASID is freed, teardown
> >> +actions are performed on KVM, device driver, and IOMMU driver.
> >> +KVM shall register notifier block with::
> >> +
> >> + static struct notifier_block pasid_nb_kvm = {
> >> + .notifier_call = pasid_status_change_kvm,
> >> + .priority = IOASID_PRIO_CPU,
> >> + };
> >> +
> >> +VDCM driver shall register notifier block with::
> >> +
> >> + static struct notifier_block pasid_nb_vdcm = {
> >> + .notifier_call = pasid_status_change_vdcm,
> >> + .priority = IOASID_PRIO_DEVICE,
> >> + };
> not sure those code snippets are really useful. Maybe simply say who
> is supposed to use each prio.
Agreed, not all the bits in the snippets are explained. I will explain
KVM and VDCM need to use priority to ensure call order.

> >> +
> >> +In both cases, notifier blocks shall be registered on the IOASID
> >> set +such that *only* events from the matching VM is received.
> >> +
> >> +If KVM attempts to register notifier block before the IOASID set
> >> is +created for the MM token, the notifier block will be placed on
> >> a
> using the MM token
sounds good

> >> +pending list inside IOASID core. Once the token matching IOASID
> >> set +is created, IOASID will register the notifier block
> >> automatically.
> Is this implementation mandated? Can't you enforce the ioasid_set to
> be created before the notifier gets registered?
> >> +IOASID core does not replay events for the existing IOASIDs in the
> >> +set. For IOASID set of MM type, notification blocks can be
> >> registered +on empty sets only. This is to avoid lost events.
> >> +
> >> +IOMMU driver shall register notifier block on global chain::
> >> +
> >> + static struct notifier_block pasid_nb_vtd = {
> >> + .notifier_call = pasid_status_change_vtd,
> >> + .priority = IOASID_PRIO_IOMMU,
> >> + };
> >> +
> >> +Custom allocator APIs
> >> +---------------------
> >> +
> >> +::
> >> +
> >> + int ioasid_register_allocator(struct ioasid_allocator_ops
> >> *allocator); +
> >> + void ioasid_unregister_allocator(struct ioasid_allocator_ops
> >> *allocator); +
> >> +Allocator Choices
> >> +~~~~~~~~~~~~~~~~~
> >> +IOASIDs are allocated for both host and guest SVA/IOVA usage.
> >> However, +allocators can be different. For example, on VT-d guest
> >> PASID +allocation must be performed via a virtual command
> >> interface which is +emulated by VMM.
> >> +
> >> +IOASID core has the notion of "custom allocator" such that guest
> >> can +register virtual command allocator that precedes the default
> >> one. +
> >> +Namespaces
> >> +~~~~~~~~~~
> >> +IOASIDs are limited system resources that default to 20 bits in
> >> +size. Since each device has its own table, theoretically the
> >> namespace +can be per device also. However, for security reasons
> >> sharing PASID +tables among devices are not good for isolation.
> >> Therefore, IOASID +namespace is system-wide.
> >
> > I don't follow this development. Having per-device PASID table
> > would work fine for isolation (assuming no hardware bug
> > necessitating IOMMU groups). If I remember correctly IOASID space
> > was chosen to be OS-wide because it simplifies the management code
> > (single PASID per task), and it is system-wide across VMs only in
> > the case of VT-d scalable mode.
> >> +
> >> +There are also other reasons to have this simpler system-wide
> >> +namespace. Take VT-d as an example, VT-d supports shared workqueue
> >> +and ENQCMD[1] where one IOASID could be used to submit work on
> >
> > Maybe use the Sphinx glossary syntax rather than "[1]"
> > https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
> >
> >> +multiple devices that are shared with other VMs. This requires
> >> IOASID +to be system-wide. This is also the reason why guests must
> >> use an +emulated virtual command interface to allocate IOASID from
> >> the host. +
> >> +
> >> +Life cycle
> >> +==========
> >> +This section covers IOASID lifecycle management for both
> >> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is
> >> directly hooked +up with IOMMU driver, therefore the process
> >> address space (MM) +lifecycle is aligned with IOASID.
> therefore the IOASID lifecyle matches the process address space (MM)
> lifecyle?
Sounds good.

> >> +
> >> +However, guest MMU notifier is not available to host IOMMU
> >> driver,
> the guest MMU notifier
> >> +when guest MM terminates unexpectedly, the events have to go
> >> through
> the guest MM
> >> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also
> >> more +parties involved in guest SVA, e.g. on Intel VT-d platform,
> >> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
> >> +
> >> +Native IOASID Life Cycle (VT-d Example)
> >> +---------------------------------------
> >> +
> >> +The normal flow of native SVA code with Intel Data Streaming
> >> +Accelerator(DSA) [2] as example:
> >> +
> >> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> >> +2. DSA driver allocate WQ, do sva_bind_device();
> >> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> >> + mmu_notifier_get()
> >> +4. DMA starts by DSA driver userspace
> >> +5. DSA userspace close FD
> >> +6. DSA/uacce kernel driver handles FD.close()
> >> +7. DSA driver stops DMA
> >> +8. DSA driver calls sva_unbind_device();
> >> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> >> + TLBs. mmu_notifier_put() called.
> >> +10. mmu_notifier.release() called, IOMMU SVA code calls
> >> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
> >> +
> >> +::
> >> +
> >
> > Use a footnote?
> > https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
> >> + * With ENQCMD, PASID used on VT-d is not released in
> >> mmu_notifier() but
> >> + mmdrop(). mmdrop comes after FD close. Should not matter.
> >
> > "comes after FD close, which doesn't make a difference?"
> > The following might not be necessary since early process
> > termination is described later.
> >
> >> + If the user process dies unexpectedly, Step #10 may come
> >> before
> >> + Step #5, in between, all DMA faults discarded. PRQ responded
> >> with
> >
> > PRQ hasn't been defined in this document.
> >
> >> + code INVALID REQUEST.
> >> +
> >> +During the normal teardown, the following three steps would
> >> happen in +order:
> can't this be illustrated in the above 1-11 sequence, just adding
> NORMAL TEARDONW before #7?
> >> +
> >> +1. Device driver stops DMA request
> >> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
> >> in-flight
> >> + requests.
> >> +3. IOASID freed
> >> +
> Then you can just focus on abnormal termination
Yes, will refer to the steps starting #7. These can be removed.

> >> +Exception happens when process terminates *before* device driver
> >> stops +DMA and call IOMMU driver to unbind. The flow of process
> >> exists are as
> Can't this be explained with something simpler looking at the steps
> 1-11?
It meant to be educational given this level of details. Simpler
steps are labeled with (1) (2) (3). Perhaps these labels didn't stand
out right? I will use the steps in the 1-11 sequence.

> >
> > "exits"
> >
> >> +follows:
> >> +
> >> +::
> >> +
> >> + do_exit() {
> >> + exit_mm() {
> >> + mm_put();
> >> + exit_mmap() {
> >> + intel_invalidate_range() //mmu notifier
> >> + tlb_finish_mmu()
> >> + mmu_notifier_release(mm) {
> >> + intel_iommu_release() {
> >> + [2]
> >> intel_iommu_teardown_pasid();
> >
> > Parentheses might be better than square brackets for step numbers
> >
> >> + intel_iommu_flush_tlbs();
> >> + }
> >> + // tlb_invalidate_range cb removed
> >> + }
> >> + unmap_vmas();
> >> + free_pgtables(); // IOMMU cannot walk PGT
> >> after this
> >> + };
> >> + }
> >> + exit_files(tsk) {
> >> + close_files() {
> >> + dsa_close();
> >> + [1] dsa_stop_dma();
> >> + intel_svm_unbind_pasid(); //nothing to do
> >> + }
> >> + }
> >> + }
> >> +
> >> + mmdrop() /* some random time later, lazy mm user */ {
> >> + mm_free_pgd();
> >> + destroy_context(mm); {
> >> + [3] ioasid_free();
> >> + }
> >> + }
> >> +
> >> +As shown in the list above, step #2 could happen before
> >> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
> >> +
> >> +Also notice that TLB invalidation occurs at mmu_notifier
> >> +invalidate_range callback as well as the release callback. The
> >> reason +is that release callback will delete IOMMU driver from the
> >> notifier +chain which may skip invalidate_range() calls during the
> >> exit path. +
> >> +To avoid unnecessary reporting of UR fault, IOMMU driver shall
> >> disable
> UR?
Unrecoverable, mentioned in the previous paragraph.

> >> +fault reporting after free and before unbind.
> >> +
> >> +Guest IOASID Life Cycle (VT-d Example)
> >> +--------------------------------------
> >> +Guest IOASID life cycle starts with guest driver open(), this
> >> could be +uacce or individual accelerator driver such as DSA. At
> >> FD open, +sva_bind_device() is called which triggers a series of
> >> actions. +
> >> +The example below is an illustration of *normal* operations that
> >> +involves *all* the SW components in VT-d. The flow can be simpler
> >> if +no ENQCMD is supported.
> >> +
> >> +::
> >> +
> >> + VFIO IOMMU KVM VDCM IOASID
> >> Ref
> >> + ..................................................................
> >> + 1 ioasid_register_notifier/_mm()
> >> + 2 ioasid_alloc()
> >> 1
> >> + 3 bind_gpasid()
> >> + 4 iommu_bind()->ioasid_get()
> >> 2
> >> + 5 ioasid_notify(BIND)
> >> + 6 -> ioasid_get()
> >> 3
> >> + 7 -> vmcs_update_atomic()
> >> + 8 mdev_write(gpasid)
> >> + 9 hpasid=
> >> + 10 find_by_spid(gpasid)
> >> 4
> >> + 11 vdev_write(hpasid)
> >> + 12 -------- GUEST STARTS DMA --------------------------
> >> + 13 -------- GUEST STOPS DMA --------------------------
> >> + 14 mdev_clear(gpasid)
> >> + 15 vdev_clear(hpasid)
> >> + 16
> >> ioasid_put() 3
> >> + 17 unbind_gpasid()
> >> + 18 iommu_ubind()
> >> + 19 ioasid_notify(UNBIND)
> >> + 20 -> vmcs_update_atomic()
> >> + 21 ->
> >> ioasid_put() 2
> >> + 22
> >> ioasid_free() 1
> >> + 23
> >> ioasid_put() 0
> >> + 24 Reclaimed
> >> + -------------- New Life Cycle Begin
> >> ----------------------------
> >> + 1 ioasid_alloc()
> >> -> 1 +
> >> + Note: IOASID Notification Events: FREE, BIND, UNBIND
> >> +
> >> +Exception cases arise when a guest crashes or a malicious guest
> >> +attempts to cause disruption on the host system. The fault
> >> handling +rules are:
> >> +
> >> +1. IOASID free must *always* succeed.
> >> +2. An inactive period may be required before the freed IOASID is
> >> + reclaimed. During this period, consumers of IOASID perform
> >> cleanup. +3. Malfunction is limited to the guest owned resources
> >> for all
> >> + programming errors.
> >> +
> >> +The primary source of exception is when the following are out of
> >> +order:
> >> +
> >> +1. Start/Stop of DMA activity
> >> + (Guest device driver, mdev via VFIO)
> please explain the meaning of what is inside (): initiator?
> >> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> >> + (Host IOMMU driver bind/unbind)
> >> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> >> + case of ENQCMD
> >> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> >> +5. IOASID alloc/free (Host IOASID)
> >> +
> >> +VFIO is the *only* user-kernel interface, which is ultimately
> >> +responsible for exception handlings.
> >
> > "handling"
> >
> >> +
> >> +#1 is processed the same way as the assigned device today based on
> >> +device file descriptors and events. There is no special handling.
> >> +
> >> +#3 is based on bind/unbind events emitted by #2.
> >> +
> >> +#4 is naturally aligned with IOASID life cycle in that an illegal
> >> +guest PASID programming would fail in obtaining reference of the
> >> +matching host IOASID.
> >> +
> >> +#5 is similar to #4. The fault will be reported to the user if
> >> PASID +used in the ENQCMD is not set up in VMCS PASID translation
> >> table. +
> >> +Therefore, the remaining out of order problem is between #2 and
> >> +#5. I.e. unbind vs. free. More specifically, free before unbind.
> >> +
> >> +IOASID notifier and refcounting are used to ensure order.
> >> Following +a publisher-subscriber pattern where:
> with the following actors:
> >> +
> >> +- Publishers: VFIO & IOMMU
> >> +- Subscribers: KVM, VDCM, IOMMU
> this may be introduced before.
> >> +
> >> +IOASID notifier is atomic which requires subscribers to do quick
> >> +handling of the event in the atomic context. Workqueue can be
> >> used for +any processing that requires thread context.
> repetition of what was said before.
> IOASID reference must be
Right, will remove.

> >> +acquired before receiving the FREE event. The reference must be
> >> +dropped at the end of the processing in order to return the
> >> IOASID to +the pool.
> >> +
> >> +Let's examine the IOASID life cycle again when free happens
> >> *before* +unbind. This could be a result of misbehaving guests or
> >> crash. Assuming +VFIO cannot enforce unbind->free order. Notice
> >> that the setup part up +until step #12 is identical to the normal
> >> case, the flow below starts +with step 13.
> >> +
> >> +::
> >> +
> >> + VFIO IOMMU KVM VDCM IOASID
> >> Ref
> >> + ..................................................................
> >> + 13 -------- GUEST STARTS DMA --------------------------
> >> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> >> + 15 ioasid_free()
> >> + 16
> >> ioasid_notify(FREE)
> >> + 17
> >> mark_ioasid_inactive[1]
> >> + 18 kvm_nb_handler(FREE)
> >> + 19 vmcs_update_atomic()
> >> + 20 ioasid_put_locked() ->
> >> 3
> >> + 21 vdcm_nb_handler(FREE)
> >> + 22 iomm_nb_handler(FREE)
> >> + 23 ioasid_free() returns[2] schedule_work()
> >> 2
> >> + 24 schedule_work() vdev_clear_wk(hpasid)
> >> + 25 teardown_pasid_wk()
> >> + 26 ioasid_put() ->
> >> 1
> >> + 27 ioasid_put()
> >> 0
> >> + 28 Reclaimed
> >> + 29 unbind_gpasid()
> >> + 30 iommu_unbind()->ioasid_find() Fails[3]
> >> + -------------- New Life Cycle Begin
> >> ---------------------------- +
> >> +Note:
> >> +
> >> +1. By marking IOASID inactive at step #17, no new references can
> >> be
> >
> > Is "inactive" FREE_PENDING?
> >
> >> + held. ioasid_get/find() will return -ENOENT;
> >> +2. After step #23, all events can go out of order. Shall not
> >> affect
> >> + the outcome.
> >> +3. IOMMU driver fails to find private data for unbinding. If
> >> unbind is
> >> + called after the same IOASID is allocated for the same guest
> >> again,
> >> + this is a programming error. The damage is limited to the guest
> >> + itself since unbind performs permission checking based on the
> >> + IOASID set associated with the guest process.
> >> +
> >> +KVM PASID Translation Table Updates
> >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> +Per VM PASID translation table is maintained by KVM in order to
> >> +support ENQCMD in the guest. The table contains host-guest PASID
> >> +translations to be consumed by CPU ucode. The synchronization of
> >> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
> >> atomic +notifiers are used. KVM must register IOASID notifier per
> >> VM instance +during launch time. The following events are handled:
> >> +
> >> +1. BIND/UNBIND
> >> +2. FREE
> >> +
> >> +Rules:
> >> +
> >> +1. Multiple devices can bind with the same PASID, this can be
> >> different PCI
> >> + devices or mdevs within the same PCI device. However, only the
> >> + *first* BIND and *last* UNBIND emit notifications.
> >> +2. IOASID code is responsible for ensuring the correctness of H-G
> >> + PASID mapping. There is no need for KVM to validate the
> >> + notification data.
> >> +3. When UNBIND happens *after* FREE, KVM will see error in
> >> + ioasid_get() even when the reclaim is not done. IOMMU driver
> >> will
> >> + also avoid sending UNBIND if the PASID is already FREE.
> >> +4. When KVM terminates *before* FREE & UNBIND, references will be
> >> + dropped for all host PASIDs.
> >> +
> >> +VDCM PASID Programming
> >> +~~~~~~~~~~~~~~~~~~~~~~
> >> +VDCM composes virtual devices and exposes them to the guests. When
> >> +the guest allocates a PASID then program it to the virtual
> >> device, VDCM
> programs as well
> >> +intercepts the programming attempt then program the matching
> >> host
> >
> > "programs"
> >
> > Thanks,
> > Jean
> >
> >> +PASID on to the hardware.
> >> +Conversely, when a device is going away, VDCM must be informed
> >> such +that PASID context on the hardware can be cleared. There
> >> could be +multiple mdevs assigned to different guests in the same
> >> VDCM. Since +the PASID table is shared at PCI device level, lazy
> >> clearing is not +secure. A malicious guest can attack by using
> >> newly freed PASIDs that +are allocated by another guest.
> >> +
> >> +By holding a reference of the PASID until VDCM cleans up the HW
> >> context, +it is guaranteed that PASID life cycles do not cross
> >> within the same +device.
> >> +
> >> +
> >> +Reference
> >> +====================================================
> >> +1.
> >> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> >> + +2.
> >> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> >> + +3.
> >> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
> >> -- 2.7.4
>
> Thanks
>
> Eric
> >>
> >
>
> _______________________________________________
> iommu mailing list
> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
[Jacob Pan]