Re: RFC: Device isolation infrastructure

From: Alex Williamson
Date: Thu Dec 08 2011 - 00:52:30 EST

Next message: Mike Galbraith: "Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups onlywhen crossing cache domains"
Previous message: Ingo Molnar: "Re: [PATCH 09/11] sched: export task_prio to GPL modules"
Next in thread: David Gibson: "Re: RFC: Device isolation infrastructure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, 2011-12-08 at 10:41 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2011-12-07 at 12:45 -0700, Alex Williamson wrote:
>
> > Ok, let's back up from the patch because I think its clouding the driver
> > model with a detached device state that might be detracting from the
> > useful bits.
> >
> > An isolation group is a generic association of devices into a set based
> > on, well... isolation. Presumably something like an iommu driver does a
> > bus_register_notifier() and bus_for_each_dev() and when a new device is
> > added we do isolation_group_new() and isolation_group_dev_add() as
> > necessary. When the iommu driver needs to do a dma_ops call for a
> > device, it can traverse up to the isolation group and find something
> > that points it to the shared iommu context for that group. While
> > dma_ops can share the iommu context among the drivers attached to the
> > various devices, iommu_ops wants a single meta-driver for the group
> > because it's predominantly designed for use when devices are exposed
> > directly to the user. We trust native kernel drivers to share context
> > between each other, we do not trust user level drivers to share context
> > with anyone but themselves. Ok so far?
> >
> > The first question we run into is what does the iommu driver do if it
> > does not provide isolation? If it's a translation-only iommu, maybe it
> > has a single context and can manage to maintain it internally and should
> > opt-out of isolation groups. But we also have things like MSI isolation
> > where we have multiple contexts and the isolation strength is sufficient
> > for dma_ops, but not iommu_ops. I don't know if we want to make that
> > determination via a strength bitmask at the group level or simply a bool
> > that the iommu driver sets when it registers a group.
>
> .../...
>
> I think we are trying to solve too many problems at once.
>
> I don't believe we should have the normal dma ops path in the kernel be
> rewritten on top of some new iommu isolation layer or anything like that
> at this stage.

I headed down this path because David seems to be suggesting, well
stating, that all this infrastructure is useful for providing simplicity
and efficiency of groups to the core, and not just for this niche vfio
usage. "You don't get a choice about whether you use groups." The only
way I can rationalize that is to give it some purpose which is used by
something other that just vfio. If all of that is "some day", then why
don't we start with the group management in vfio as I have it today and
eventually push that down into the core?

> Proceed step by step or we'll never be finished. What that means imho
> is:
>
> - Existing dma/iommu arch layers mostly remain as-is. It has its own
> internal representation of devices, domains and possibly grouping
> information, and things like the buggy RICOH devices are handled there
> (via a flag set in a generic header quirk)

But we're already being harassed to rework iommu_ops to operate based on
groups and to take grouping into account for things like the ricoh
device in dma_ops. Mixed messages...

> - Isolation groups are a new high-level thing that is not used by the
> iommu/dma code for normal in-kernel operations at this stage. The
> platform iommu/dma code creates the isolation group objects and
> populates them, it doesn't initially -use- them.

The iommu_device_group() interface is a tiny extension to the IOMMU API
that basically lets vfio query the iommu driver for a group identifier
for a device. vfio then uses that information to create and manage
groups itself.

This is simply reversing that, allowing the iommu driver to create
groups and add devices, which are centrally managed. That sounds like a
good long term goal, especially when we start making use of the groups
for dma/iommu ops, but it does require that we define an API both for
group providers (iommu drivers) and the group users (vfio). It's also
much more intrusive to the iommu drivers (note the trivial patches to
add iommu_device_group support to intel and amd iommu) and creates bloat
for a niche user if we're not going to get around to making groups more
pervasive anytime soon.

> - We can then use that for the needs of exposing the groups to user
> space, perform driver level isolation, and provide the grouping
> information to vfio.
>
> - In the -long- run, individual iommu/dma layers might be modified to
> instead use the isolation layer as their own building block if that
> makes sense, this can be looked at on a case by case basis but is not a
> pre-req.
>
> That leaves us with what I think is the main disagreement between David
> and Alex approaches which is whether devices in a group are "detached"
> completely (in some kind of limbo) while they are used by vfio or
> something else vs. assigned to a different driver (vfio).
>
> After much thinking, I believe that the driver ownership model is the
> right one. However, we could still use the work David did here. IE. I
> _do_ think that exposing groups as the individual unit of ownership to
> user space is the right thing to do.

So far we're only exposing the group structure here (which vfio also
does in sysfs), not providing a userspace or driver interface for taking
ownership of the group. I'm still not clear how this ends up satisfying
David's complaints about the group ownership vs device access model in
vfio.

> That means that instead of that current detached flag alone, what we
> could do instead is have an in-kernel mechanism to request isolation
> which works by having the isolation group kernel code perform:
>
> - Rename the detached flag to "exclusive". This flag prevents automatic
> & user driven attachment of a driver by the generic code. Set that flag
> on all devices in the group (need a hook for hotplug).

This is a sysfs attribute on the group? So step #1:

$ echo 1 > /sys/isolation/$GROUP/exclusive

> - Detach existing drivers.

step #2:

$ for i in $(ls /sys/isolation/$GROUP/devices/); do \
echo $(basename $i) > $i/driver/unbind; done
>
> - Once everything is detached, attach the requested client driver (vfio
> being the one we initially provide, maybe uio could use that framework
> as well). This can be done using an explicit in-kernel attach API that
> can be used even when the exclusive flag is set.

step #3: A miracle occurs ;)

I'm not sure what this looks like. Are we going to do a
isolation_group_register_meta_driver() that vfio calls with some defined
ops structure, so a user can:

$ echo "vfio" > /sys/isolation/$GROUP/driver

That triggers some list walk of the group devices, asking vfio to find
and force a driver attach to the bus driver for each device? A model
something like that is the only way I can see that removing groups from
vfio doesn't also disassociate the group meta driver from the group
device drivers.

If instead, we make vfio be the group manager (since nobody else uses it
anyway), all of the above becomes:

$ for i in $(ls /sys/devices/virtual/vfio/pci:257/devices); do \
echo $(basename $i) > $i/driver/unbind; \
echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind; \
done

> - Win

I wish, we still have to make it through how vfio then exposes the group
to userspace and how we don't end up with the same individual device
manipulation vs group access model that seems to be getting under
David's skin. Effectively what we've done above is:

* provide a group model that we'll "some day" make more pervasive
* make the binding of devices to group device drivers implicit in
some kind of meta driver attachment

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mike Galbraith: "Re: [patch v3 5/6] sched, ttwu_queue: queue remote wakeups onlywhen crossing cache domains"
Previous message: Ingo Molnar: "Re: [PATCH 09/11] sched: export task_prio to GPL modules"
Next in thread: David Gibson: "Re: RFC: Device isolation infrastructure"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]