Re: RFC: vfio-pci API for PCI bus/slot (hot) resets

From: Benjamin Herrenschmidt
Date: Fri Aug 02 2013 - 17:31:31 EST


On Fri, 2013-08-02 at 10:36 -0600, Alex Williamson wrote:

> > Also in some cases, at least for us, we have a problem where the scope
> > of the reset can be larger than the group ... in this case I think we
> > need to forbid the reset.
>
> Interesting, I would have ventured to guess that resets are always
> contained within a single group given the PE isolation domains rooted at
> a PHB.

Depends how I configure them. For example, if I assign two functions to
two different PEs (SR-IOV typically), a hot reset will kill both.

For fundamental reset, depending on how the mobo is wired up and other
platform things, I might for example be limited on doing a PERST at the
PHB level, thus accross multiple PEs.

> Why disallow it though? What I propose below would still allow
> it if the user can prove that they own all of the affected groups.

Right, I mean disallow it if the PEs are not all owned by the same user.
What you propose would probably work.

> Even
> on x86, I think it's too restrictive to only allow reset if the affect
> is contained within a group. It's just too easy to need it for a
> multi-function device. We could hope that a device that implements ACS
> also implements some sort of FLR, but that may just be wishful thinking.

Right. Another use case is, I know of devices that need a fundamental
reset (PERST) after applying a FW update.

> > For us, it's also tied into our "EEH" error handling. IE. The way the
> > guest will request some of these things is going to be via firmware APIs
> > that we have yet to implement, but that we were also thinking of
> > implementing entire in the kernel rather than qemu for various
> > reasons... IE. There is more to error handling & recovery in our case at
> > least than AER and reset :-)
>
> I've been focused on reset for device re-initialization and
> repeatability, but that's a good point, we need to consider error
> recovery as a use of this interface as well.

Unfortunately, with EEH we have very specific FW interface (including
low level diagnostic data "blobs" etc...) that we need to implement so I
am not sure we can "fit" it in a generic interface.

We might want to attempt to list our interfaces and see. CC'ing Gavin
who is our EEH expert.

A rough cut is

- We have the concept of "frozen" PE (frozen MMIO and frozen DMA, two
separate attributes). So we can query and change the freeze state on a
PE.

- We have various level of error severity (from informational
(corrected) to full PHB need a reset with some variations in the middle
of PEs being frozen etc...)

- We have diag blobs coming out of FW for the guest to log (and
possibly partially decode).

- We have interfaces for various types of resets

> > I need to spend more time reading your proposal and see how it can work
> > for us (or not...)... but we might end up just doing our own thing that
> > carries the whole EEH API instead.
>
> Obviously it would be great if it could work for you, but regardless of
> whether you intend to use it I'd appreciate feedback. Thanks,

Ben.

> Alex
>
> > > ---
> > > Mechanism to do PCI hot resets through VFIO:
> > >
> > > VFIO is fundamentally an IOMMU group and device level interface.
> > > There's no concept of buses, slots, or hierarchies of devices. There
> > > are only IOMMU group and devices. A bus (or slot) may contain exactly
> > > one IOMMU group, multiple IOMMU groups, or a portion of an IOMMU group.
> > > An IOMMU group may contain one or more devices.
> > >
> > > The first question is perhaps where should we create the interface to do
> > > a PCI hot reset. Assuming an ioctl interface, our choices are the
> > > group, the container, or the device file descriptors. Groups and
> > > containers are not PCI specific, so an extension on either of those
> > > doesn't make much sense. They also don't have much granularity if your
> > > goal is to do a hot reset on the smallest subset of devices you can.
> > > Therefore the only choice seems to be a VFIO device level interface.
> > >
> > > The fact that a hot reset affects multiple devices also raises concerns.
> > > How do we make sure a user has sufficient access/privilege to perform
> > > this operation? If all of the affected devices are within the same
> > > group, then we know the user already "owns" all those devices. Using
> > > groups as the boundary excludes a number of use cases though. The user
> > > would need to prove that they also own the other groups or devices that
> > > are affected by the reset. This might be multiple groups, so the ioctl
> > > quickly grows to requiring a list of file descriptors be passed for
> > > validation.
> > >
> > > We already use the group file descriptor as a unit of ownership for
> > > enabling the container, so it seems like it would make sense to use it
> > > here too. The alternative is a device file descriptor, but groups may
> > > encompass devices the user doesn't care to use and we don't want to
> > > require that they open a file descriptor simply to perform a hot reset.
> > > Groups can also contain devices that the user cannot open, for instance
> > > those owned by VFIO "compatible" drivers like pci-stub or pcieport.
> > >
> > > The user also needs to know the set of devices affected by a hot reset,
> > > otherwise they have no idea which group file descriptors to pass to such
> > > an interface. That implies we also need a separate "info" ioctl for the
> > > user to learn that information. We could argue that the user could
> > > learn this information from sysfs, but that imposes non-trivial library
> > > or code overhead on the user to evaluate the topology. The PCI hot
> > > reset info ioctl would need to indicate whether a hot reset is
> > > available, and the set of affected devices. It may be useful to provide
> > > this as a {group, device} pair so the user doesn't need to
> > > cross-reference each device with sysfs to determine the group for the
> > > device. This would then provide both the set of groups required to
> > > perform the hot reset and the set of devices affected by the hot reset.
> > >
> > > As an alternative, we could consider simply requiring that all of the
> > > devices affected by a hot reset belong to the same VFIO container.
> > > However, allowing multiple groups per container is an optional IOMMU
> > > capability that really has no relation to PCI bus/slot boundaries. It
> > > seems a bit arbitrary to require groups be placed in the same container
> > > to get a PCI hot reset. That likely means we'd still need to support
> > > passing some kind of ownership token as above with groups. So it
> > > doesn't seem to make the situation any better.
> > >
> > > Given the above discussion, I therefore propose the following PCI hot
> > > reset interface:
> > >
> > > /**
> > > * VFIO_DEVICE_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + ??,
> > > * struct vfio_device_pci_hot_reset_info)
> > > */
> > >
> > > #define VFIO_DEVICE_PCI_HOT_RESET_INFO _IO(VFIO_TYPE, VFIO_BASE + ??)
> > >
> > > struct vfio_device_pci_hot_reset_info_entry {
> > > __u32 group_id;
> > > __u16 segment; /* A reset will never include devices on other segments... return it anyway */
> > > __u8 bus;
> > > __u8 devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > };
> > >
> > > struct vfio_device_pci_hot_reset_info {
> > > __u32 argsz;
> > > __u32 flags;
> > > #define VFIO_PCI_HOT_RESET_SUPPORTED (1 << 0) /* Device supports hot reset */
> > > #define VFIO_PCI_HOT_RESET_POPULATED (1 << 1) /* Entries field are populated */
> > > __u32 count;
> > > struct vfio_device_pci_hot_reset_info_entry entries[];
> > > };
> > >
> > > The user calls VFIO_DEVICE_PCI_HOT_RESET_INFO on a VFIO device file
> > > descriptor with a struct vfio_device_pci_hot_reset_info data structure,
> > > minimally the sizeof the struct with argsz set. VFIO returns whether
> > > hot reset is supported and the number of devices affected by the reset.
> > > If argsz is big enough, VFIO will fill in the entries and set the
> > > populated flag, otherwise the caller can reallocate the structure and
> > > try again.
> > >
> > > /**
> > > * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + ??,
> > > * struct vfio_device_pci_hot_reset)
> > > */
> > >
> > > #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + ??)
> > >
> > > struct vfio_device_pci_hot_reset {
> > > __u32 argsz;
> > > __u32 flags;
> > > __u32 count;
> > > __u32 fds[];
> > > };
> > >
> > > As above, the user calls VFIO_DEVICE_PCI_HOT_RESET on a VFIO device file
> > > descriptor. Using the list from PCI_HOT_RESET_INFO, the user allocates
> > > a struct vfio_device_pci_hot_reset of sufficient size to pass a list of
> > > VFIO group file descriptors. There should be one file descriptor for
> > > each group listed in the info entries. If the list of groups matches
> > > those affected by a hot reset of the device, then VFIO will perform the
> > > hot reset action and return success.
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/