[PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state

From: Abhishek Sahu
Date: Fri Jul 01 2022 - 07:09:00 EST


Currently, if the runtime power management is enabled for vfio-pci
based devices in the guest OS, then the guest OS will do the register
write for PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state (on the supported systems) which will
help in saving maximum power.

1. D3cold state can't be achieved by writing PCI standard
PM config registers. This patch implements the newly added
'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' device feature which
can be used for putting the device into the D3cold state.

2. The hypervisors can implement virtual ACPI methods. For example,
in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
resources with _ON/_OFF method, then guest linux OS invokes
the _OFF method during D3cold transition and then _ON during D0
transition. The hypervisor can tap these virtual ACPI calls and then
call the 'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' with respective flags.

3. The vfio-pci driver uses runtime PM framework to achieve the
D3cold state. For the D3cold transition, decrement the usage count and
for the D0 transition, increment the usage count.

4. If the D3cold state is not supported, then the device will
still be in the D3hot state. But with the runtime PM, the root port
can now also go into the suspended state.

5. The 'pm_runtime_engaged' flag tracks the entry and exit to
runtime PM. This flag is protected with 'memory_lock' semaphore.

6. During exit time, the flag clearing and usage count increment
are protected with 'memory_lock'. The actual wake-up is happening
outside 'memory_lock' since 'memory_lock' will be needed inside
runtime_resume callback also in subsequent patches.

7. In D3cold, all kinds of device-related access (BAR read/write,
config read/write, etc.) need to be disabled. For BAR-related access,
we can use existing D3hot memory disable support. During the low power
entry, invalidate the mmap mappings and add the check for
'pm_runtime_engaged' flag.

8. For config space, ideally, we need to return an error whenever
there is any config access from the user side once the user moved the
device into low power state. But adding a check for
'pm_runtime_engaged' flag alone won't be sufficient due to the
following possible scenario from the user side where config space
access happens parallelly with the low power entry IOCTL.

a. Config space access happens and vfio_pci_config_rw() will be
called.
b. The IOCTL to move into low power state is called.
c. The IOCTL will move the device into d3cold.
d. Exit from vfio_pci_config_rw() happened.

Now, if we just check 'pm_runtime_engaged', then in the above
sequence the config space access will happen when the device already
is in the low power state. To prevent this situation, we increment the
usage count before any config space access and decrement the same
after completing the access. Also, to prevent any similar cases for
other types of access, the usage count will be incremented for all
kinds of access.

Signed-off-by: Abhishek Sahu <abhsahu@xxxxxxxxxx>
---
drivers/vfio/pci/vfio_pci_config.c | 2 +-
drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++++++++--
include/linux/vfio_pci_core.h | 1 +
3 files changed, 164 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 9343f597182d..21a4743d011f 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -408,7 +408,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
* PF SR-IOV capability, there's therefore no need to trigger
* faults based on the virtual value.
*/
- return pdev->current_state < PCI_D3hot &&
+ return !vdev->pm_runtime_engaged && pdev->current_state < PCI_D3hot &&
(pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
}

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 5948d930449b..8c17ca41d156 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -264,6 +264,18 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
{
struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);

+ down_write(&vdev->memory_lock);
+ /*
+ * The user can move the device into D3hot state before invoking
+ * power management IOCTL. Move the device into D0 state here and then
+ * the pci-driver core runtime PM suspend function will move the device
+ * into the low power state. Also, for the devices which have
+ * NoSoftRst-, it will help in restoring the original state
+ * (saved locally in 'vdev->pm_save').
+ */
+ vfio_pci_set_power_state(vdev, PCI_D0);
+ up_write(&vdev->memory_lock);
+
/*
* If INTx is enabled, then mask INTx before going into the runtime
* suspended state and unmask the same in the runtime resume.
@@ -386,6 +398,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
struct pci_dev *pdev = vdev->pdev;
struct vfio_pci_dummy_resource *dummy_res, *tmp;
struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
+ bool do_resume = false;
int i, bar;

/* For needs_reset */
@@ -393,6 +406,25 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)

/*
* This function can be invoked while the power state is non-D0.
+ * This non-D0 power state can be with or without runtime PM.
+ * Increment the usage count corresponding to pm_runtime_put()
+ * called during setting of 'pm_runtime_engaged'. The device will
+ * wake up if it has already gone into the suspended state.
+ * Otherwise, the next vfio_pci_set_power_state() will change the
+ * device power state to D0.
+ */
+ down_write(&vdev->memory_lock);
+ if (vdev->pm_runtime_engaged) {
+ vdev->pm_runtime_engaged = false;
+ pm_runtime_get_noresume(&pdev->dev);
+ do_resume = true;
+ }
+ up_write(&vdev->memory_lock);
+
+ if (do_resume)
+ pm_runtime_resume(&pdev->dev);
+
+ /*
* This function calls __pci_reset_function_locked() which internally
* can use pci_pm_reset() for the function reset. pci_pm_reset() will
* fail if the power state is non-D0. Also, for the devices which
@@ -1190,6 +1222,99 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
}
EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);

+static int vfio_pci_pm_validate_flags(u32 flags)
+{
+ if (!flags)
+ return -EINVAL;
+ /* Only valid flags should be set */
+ if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
+ return -EINVAL;
+ /* Both enter and exit should not be set */
+ if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
+ (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
+ return -EINVAL;
+
+ return 0;
+}
+
+static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
+ void __user *arg, size_t argsz)
+{
+ struct vfio_pci_core_device *vdev =
+ container_of(device, struct vfio_pci_core_device, vdev);
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_device_feature_power_management vfio_pm = { 0 };
+ int ret = 0;
+
+ ret = vfio_check_feature(flags, argsz,
+ VFIO_DEVICE_FEATURE_SET |
+ VFIO_DEVICE_FEATURE_GET,
+ sizeof(vfio_pm));
+ if (ret != 1)
+ return ret;
+
+ if (flags & VFIO_DEVICE_FEATURE_GET) {
+ down_read(&vdev->memory_lock);
+ if (vdev->pm_runtime_engaged)
+ vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
+ else
+ vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
+ up_read(&vdev->memory_lock);
+
+ if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
+ return -EFAULT;
+
+ return 0;
+ }
+
+ if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
+ return -EFAULT;
+
+ ret = vfio_pci_pm_validate_flags(vfio_pm.flags);
+ if (ret)
+ return ret;
+
+ /*
+ * The vdev power related flags are protected with 'memory_lock'
+ * semaphore.
+ */
+ if (vfio_pm.flags & VFIO_PM_LOW_POWER_ENTER) {
+ vfio_pci_zap_and_down_write_memory_lock(vdev);
+ if (vdev->pm_runtime_engaged) {
+ up_write(&vdev->memory_lock);
+ return -EINVAL;
+ }
+
+ vdev->pm_runtime_engaged = true;
+ up_write(&vdev->memory_lock);
+ pm_runtime_put(&pdev->dev);
+ } else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
+ down_write(&vdev->memory_lock);
+ if (!vdev->pm_runtime_engaged) {
+ up_write(&vdev->memory_lock);
+ return -EINVAL;
+ }
+
+ vdev->pm_runtime_engaged = false;
+ pm_runtime_get_noresume(&pdev->dev);
+ up_write(&vdev->memory_lock);
+ ret = pm_runtime_resume(&pdev->dev);
+ if (ret < 0) {
+ down_write(&vdev->memory_lock);
+ if (!vdev->pm_runtime_engaged) {
+ vdev->pm_runtime_engaged = true;
+ pm_runtime_put_noidle(&pdev->dev);
+ }
+ up_write(&vdev->memory_lock);
+ return ret;
+ }
+ } else {
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz)
{
@@ -1224,6 +1349,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
switch (flags & VFIO_DEVICE_FEATURE_MASK) {
case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_feature_token(device, flags, arg, argsz);
+ case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
+ return vfio_pci_core_feature_pm(device, flags, arg, argsz);
default:
return -ENOTTY;
}
@@ -1234,31 +1361,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
size_t count, loff_t *ppos, bool iswrite)
{
unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ int ret;

if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
return -EINVAL;

+ ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
+ if (ret < 0) {
+ pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
+ ret);
+ return -EIO;
+ }
+
switch (index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
- return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+ ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+ break;

case VFIO_PCI_ROM_REGION_INDEX:
if (iswrite)
- return -EINVAL;
- return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+ ret = -EINVAL;
+ else
+ ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+ break;

case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
- return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+ ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+ break;

case VFIO_PCI_VGA_REGION_INDEX:
- return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+ ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+ break;
+
default:
index -= VFIO_PCI_NUM_REGIONS;
- return vdev->region[index].ops->rw(vdev, buf,
+ ret = vdev->region[index].ops->rw(vdev, buf,
count, ppos, iswrite);
+ break;
}

- return -EINVAL;
+ pm_runtime_put(&vdev->pdev->dev);
+ return ret;
}

ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
@@ -2157,6 +2300,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
goto err_unlock;
}

+ /*
+ * Some of the devices in the dev_set can be in the runtime suspended
+ * state. Increment the usage count for all the devices in the dev_set
+ * before reset and decrement the same after reset.
+ */
+ ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
+ if (ret)
+ goto err_unlock;
+
list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
/*
* Test whether all the affected devices are contained by the
@@ -2212,6 +2364,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
else
mutex_unlock(&cur->vma_lock);
}
+
+ list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+ pm_runtime_put(&cur->pdev->dev);
err_unlock:
mutex_unlock(&dev_set->lock);
return ret;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index cdfd328ba6b1..bf4823b008f9 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -125,6 +125,7 @@ struct vfio_pci_core_device {
bool nointx;
bool needs_pm_restore;
bool pm_intx_masked;
+ bool pm_runtime_engaged;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
int ioeventfds_nr;
--
2.17.1