Re: [PATCH v4 6/6] vfio/pci: Add support for virtual PME

From: Alex Williamson
Date: Wed Jul 06 2022 - 11:48:23 EST


On Fri, 1 Jul 2022 16:38:14 +0530
Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:

> If the PCI device is in low power state and the device requires
> wake-up, then it can generate PME (Power Management Events). Mostly
> these PME events will be propagated to the root port and then the
> root port will generate the system interrupt. Then the OS should
> identify the device which generated the PME and should resume
> the device.
>
> We can implement a similar virtual PME framework where if the device
> already went into the runtime suspended state and then there is any
> wake-up on the host side, then it will send the virtual PME
> notification to the guest. This virtual PME will be helpful for the cases
> where the device will not be suspended again if there is any wake-up
> triggered by the host. Following is the overall approach regarding
> the virtual PME.
>
> 1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named
> VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set
> this new IRQ.
>
> 2. From the guest side, the guest needs to enable eventfd for the
> virtual PME notification.
>
> 3. In the vfio-pci driver, the PME support bits are currently
> virtualized and set to 0. We can set PME capability support for all
> the power states. This PME capability support is independent of the
> physical PME support.
>
> 4. The PME enable (PME_En bit in Power Management Control/Status
> Register) and PME status (PME_Status bit in Power Management
> Control/Status Register) are also virtualized currently.
> The write support for PME_En bit can be enabled.
>
> 5. The PME_Status bit is a write-1-clear bit where the write with
> zero value will have no effect and write with 1 value will clear the
> bit. The write for this bit will be trapped inside
> vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES.
>
> 6. When the host gets a request for resuming the device other than from
> low power exit feature IOCTL, then PME_Status bit will be set.
> According to [PCIe v5 7.5.2.2],
> "PME_Status - This bit is Set when the Function would normally
> generate a PME signal. The value of this bit is not affected by
> the value of the PME_En bit."
>
> So even if PME_En bit is not set, we can set PME_Status bit.
>
> 7. If the guest has enabled PME_En and registered for PME events
> through eventfd, then the usage count will be incremented to prevent
> the device to go into the suspended state and notify the guest through
> eventfd trigger.
>
> The virtual PME can help in handling physical PME also. When
> physical PME comes, then also the runtime resume will be called. If
> the guest has registered for virtual PME, then it will be sent in this
> case also.
>
> * Implementation for handling the virtual PME on the hypervisor:
>
> If we take the implementation in Linux OS, then during runtime suspend
> time, then it calls __pci_enable_wake(). It internally enables PME
> through pci_pme_active() and also enables the ACPI side wake-up
> through platform_pci_set_wakeup(). To handle the PME, the hypervisor has
> the following two options:
>
> 1. Create a virtual root port for the VFIO device and trigger
> interrupt when the PME comes. It will call pcie_pme_irq() which will
> resume the device.
>
> 2. Create a virtual ACPI _PRW resource and associate it with the device
> itself. In _PRW, any GPE (General Purpose Event) can be assigned for
> the wake-up. When PME comes, then GPE can be triggered by the
> hypervisor. GPE interrupt will call pci_acpi_wake_dev() function
> internally and it will resume the device.

Do we really need to implement PME emulation in the kernel or is it
sufficient for userspace to simply register a one-shot eventfd when
SET'ing the low power feature and QEMU can provide the PME emulation
based on that signaling? Thanks,

Alex

>
> Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
> ---
> drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------
> drivers/vfio/pci/vfio_pci_core.c | 43 ++++++++++++++++++++++++------
> drivers/vfio/pci/vfio_pci_intrs.c | 18 +++++++++++++
> include/linux/vfio_pci_core.h | 2 ++
> include/uapi/linux/vfio.h | 1 +
> 5 files changed, 87 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 21a4743d011f..a06375a03758 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
> if (count < 0)
> return count;
>
> + /*
> + * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the
> + * bit in vconfig. The PME_STATUS is in the upper byte of the control
> + * register and user can do single byte write also.
> + */
> + if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) {
> + if (le32_to_cpu(val) &
> + (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) {
> + __le16 *ctrl = (__le16 *)&vdev->vconfig
> + [vdev->pm_cap_offset + PCI_PM_CTRL];
> + *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
> + }
> + }
> +
> if (offset == PCI_PM_CTRL) {
> pci_power_t state;
>
> @@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
> * the user change power state, but we trap and initiate the
> * change ourselves, so the state bits are read-only.
> *
> - * The guest can't process PME from D3cold so virtualize PME_Status
> - * and PME_En bits. The vconfig bits will be cleared during device
> - * capability initialization.
> + * The guest can't process physical PME from D3cold so virtualize
> + * PME_Status and PME_En bits. These bits will be used for the
> + * virtual PME between host and guest. The vconfig bits will be
> + * updated during device capability initialization. PME_Status is
> + * write-1-clear bit, so it is read-only. We trap and update the
> + * vconfig bit manually during write.
> */
> p_setd(perm, PCI_PM_CTRL,
> PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
> - ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
> - PCI_PM_CTRL_STATE_MASK));
> + ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS));
>
> return 0;
> }
> @@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
> __le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
> __le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
>
> - /* Clear vconfig PME_Support, PME_Status, and PME_En bits */
> - *pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
> + /*
> + * Set the vconfig PME_Support bits. The PME_Status is being used for
> + * virtual PME support and is not dependent upon the physical
> + * PME support.
> + */
> + *pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK);
> + /* Clear vconfig PME_Support and PME_En bits */
> *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
> }
>
> @@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
> if (ret)
> return ret;
>
> - if (cap == PCI_CAP_ID_PM)
> + if (cap == PCI_CAP_ID_PM) {
> + vdev->pm_cap_offset = pos;
> vfio_update_pm_vconfig_bytes(vdev, pos);
> + }
>
> prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
> pos = next;
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 1ddaaa6ccef5..6c1225bc2aeb 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
> * the low power state or closed the device.
> * - If there is device access on the host side.
> *
> - * For the second case, check if re-entry to the low power state is
> - * allowed. If not, then increment the usage count so that runtime PM
> - * framework won't suspend the device and set the 'pm_runtime_resumed'
> - * flag.
> + * For the second case:
> + * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set
> + * and user has registered for virtual PME events, then send the PME
> + * virtual PME event.
> + * - Check if re-entry to the low power state is not allowed.
> + *
> + * For the above conditions, increment the usage count so that
> + * runtime PM framework won't suspend the device and set the
> + * 'pm_runtime_resumed' flag.
> */
> - if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
> - pm_runtime_get_noresume(dev);
> - vdev->pm_runtime_resumed = true;
> + if (vdev->pm_runtime_engaged) {
> + bool pme_triggered = false;
> + __le16 *ctrl = (__le16 *)&vdev->vconfig
> + [vdev->pm_cap_offset + PCI_PM_CTRL];
> +
> + *ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
> + if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) {
> + mutex_lock(&vdev->igate);
> + if (vdev->pme_trigger) {
> + pme_triggered = true;
> + eventfd_signal(vdev->pme_trigger, 1);
> + }
> + mutex_unlock(&vdev->igate);
> + }
> +
> + if (!vdev->pm_runtime_reentry_allowed || pme_triggered) {
> + pm_runtime_get_noresume(dev);
> + vdev->pm_runtime_resumed = true;
> + }
> }
> up_write(&vdev->memory_lock);
>
> @@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
> eventfd_ctx_put(vdev->req_trigger);
> vdev->req_trigger = NULL;
> }
> + if (vdev->pme_trigger) {
> + eventfd_ctx_put(vdev->pme_trigger);
> + vdev->pme_trigger = NULL;
> + }
> mutex_unlock(&vdev->igate);
> }
> EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
> @@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
> } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
> if (pci_is_pcie(vdev->pdev))
> return 1;
> - } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
> + } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX ||
> + irq_type == VFIO_PCI_PME_IRQ_INDEX) {
> return 1;
> }
>
> @@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> switch (info.index) {
> case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
> case VFIO_PCI_REQ_IRQ_INDEX:
> + case VFIO_PCI_PME_IRQ_INDEX:
> break;
> case VFIO_PCI_ERR_IRQ_INDEX:
> if (pci_is_pcie(vdev->pdev))
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 1a37db99df48..db4180687a74 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev,
> count, flags, data);
> }
>
> +static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev,
> + unsigned index, unsigned start,
> + unsigned count, uint32_t flags, void *data)
> +{
> + if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1)
> + return -EINVAL;
> +
> + return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger,
> + count, flags, data);
> +}
> +
> int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
> unsigned index, unsigned start, unsigned count,
> void *data)
> @@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
> break;
> }
> break;
> + case VFIO_PCI_PME_IRQ_INDEX:
> + switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
> + case VFIO_IRQ_SET_ACTION_TRIGGER:
> + func = vfio_pci_set_pme_trigger;
> + break;
> + }
> + break;
> }
>
> if (!func)
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 18cc83b767b8..ee2646d820c2 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -102,6 +102,7 @@ struct vfio_pci_core_device {
> bool bar_mmap_supported[PCI_STD_NUM_BARS];
> u8 *pci_config_map;
> u8 *vconfig;
> + u8 pm_cap_offset;
> struct perm_bits *msi_perm;
> spinlock_t irqlock;
> struct mutex igate;
> @@ -133,6 +134,7 @@ struct vfio_pci_core_device {
> int ioeventfds_nr;
> struct eventfd_ctx *err_trigger;
> struct eventfd_ctx *req_trigger;
> + struct eventfd_ctx *pme_trigger;
> struct list_head dummy_resources_list;
> struct mutex ioeventfds_lock;
> struct list_head ioeventfds_list;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 7e00de5c21ea..08170950d655 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -621,6 +621,7 @@ enum {
> VFIO_PCI_MSIX_IRQ_INDEX,
> VFIO_PCI_ERR_IRQ_INDEX,
> VFIO_PCI_REQ_IRQ_INDEX,
> + VFIO_PCI_PME_IRQ_INDEX,
> VFIO_PCI_NUM_IRQS
> };
>