Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state
From: Alex Williamson
Date: Wed Nov 17 2021 - 12:53:36 EST
On Mon, 15 Nov 2021 19:06:40 +0530
<abhsahu@xxxxxxxxxx> wrote:
> From: Abhishek Sahu <abhsahu@xxxxxxxxxx>
>
> Currently, if the runtime power management is enabled for vfio-pci
> device in guest OS, then guest OS will do the register write for
> PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state which will help in saving
> maximum power.
>
> This patch uses runtime PM framework whenever vfio-pci device will
> be put in the low power state.
>
> 1. If runtime PM is enabled, then instead of directly writing
> PCI_PM_CTRL register, decrement the device usage counter whenever
> the power state is non-D0. The kernel runtime PM framework will
> itself put the PCI device in low power state when device usage
> counter will become zero. Similarly, when the power state will be
> again changed back to D0, then increment the device usage counter
> and the kernel runtime PM framework will itself bring the PCI device
> out of low power state.
>
> 2. The guest OS will read the PCI_PM_CTRL register back to
> confirm the current power state so virtual register bits can be
> used. For this, before decrementing the usage count, read the actual
> PCI_PM_CTRL register, update the power state related bits, and then
> update the vconfig bits corresponding to PCI_PM_CTRL offset. For
> PCI_PM_CTRL register read, return the virtual value if runtime PM is
> requested. This vconfig bits will be cleared when the power state
> will be changed back to D0.
>
> 3. For the guest OS, the PCI power state will be still D3hot
> even if put the actual PCI device into D3cold state. In the D3hot
> state, the config space can still be read. So, if there is any request
> from guest OS to read the config space, then we need to move the actual
> PCI device again back to D0. For this, increment the device usage
> count before reading/writing the config space and then decrement it
> again after reading/writing the config space. There can be
> back-to-back config register read/write request, but since the auto
> suspend methods are being used here so only first access will
> wake up the PCI device and for the remaining access, the device will
> already be active.
>
> 4. This above-mentioned wake up is not needed if the register
> read/write is done completely with virtual bits. For handling this
> condition, the actual resume of device will only being done in
> vfio_user_config_read()/vfio_user_config_write(). All the config
> register access will come vfio_pci_config_rw(). So, in this
> function, use the pm_runtime_get_noresume() so that only usage count
> will be incremented without resuming the device. Inside,
> vfio_user_config_read()/vfio_user_config_write(), use the routines
> with pm_runtime_put_noidle() so that the PCI device won’t be
> suspended in the lower level functions. Again in the top level
> vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
> auto suspend timer will be started and the device will be suspended
> again. If the device is already runtime suspended, then this routine
> will return early.
>
> 5. In the host side D3cold will only be used if the platform has
> support for this, otherwise some other state will be used. The
> config space can be read if the device is not in D3cold state. So in
> this case, we can skip the resuming of PCI device. The wrapper
> function vfio_pci_config_pm_runtime_get() takes care of this
> condition and invoke the pm_runtime_resume() only if current power
> state is D3cold.
>
> 6. For vfio_pci_config_pm_runtime_get()/vfio_
> pci_config_pm_runtime_put(), the reference code is taken from
> pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
> is modified according to vfio-pci driver requirement.
>
> 7. vfio_pci_set_power_state() will be unused after moving to runtime
> PM, so this function can be removed along with other related
> functions and structure fields.
If we're transitioning a device to D3cold rather than D3hot as
requested by userspace, isn't that a user visible change? For
instance, a device may report NoSoftRst- indicating that the device
does not do a soft reset on D3hot->D0 transition. If we're instead
putting the device in D3cold, then a transition back to D0 has very
much undergone a reset. On one hand we should at least virtualize the
NoSoftRst bit to allow the guest to restore the device, but I wonder if
that's really safe. Is a better option to prevent entering D3cold if
the device isn't natively reporting NoSoftRst-?
We're also essentially making a policy decision on behalf of userspace
that favors power saving over latency. Is that universally the correct
trade-off? I can imagine this could be desirable for many use cases,
but if we're going to covertly replace D3hot with D3cold, it seems like
there should be an opt-in. Is co-opting the PM capability for this
even really acceptable or should there be a device ioctl to request
D3cold and plumbing through QEMU such that a VM guest can make informed
choices regarding device power management?
Also if the device is not responsive to config space due to the user
placing it in D3 now, I'd expect there are other ioctl paths that need
to be blocked, maybe even MMIO paths that might be a gap for existing
D3hot support. Thanks,
Alex