Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

From: Abhishek Sahu
Date: Mon May 30 2022 - 07:16:27 EST


On 5/10/2022 6:56 PM, Abhishek Sahu wrote:
> On 5/10/2022 3:18 AM, Alex Williamson wrote:
>> On Thu, 5 May 2022 17:46:20 +0530
>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>>
>>> On 5/5/2022 1:15 AM, Alex Williamson wrote:
>>>> On Mon, 25 Apr 2022 14:56:15 +0530
>>>> Abhishek Sahu <abhsahu@xxxxxxxxxx> wrote:
>>>>

<snip>

>>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>>>> index af0ae80ef324..65b1bc9586ab 100644
>>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>>> @@ -25,6 +25,7 @@
>>>>> #include <linux/uaccess.h>
>>>>> #include <linux/vfio.h>
>>>>> #include <linux/slab.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>
>>>>> #include <linux/vfio_pci_core.h>
>>>>>
>>>>> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>>> ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>> size_t count, loff_t *ppos, bool iswrite)
>>>>> {
>>>>> + struct device *dev = &vdev->pdev->dev;
>>>>> size_t done = 0;
>>>>> int ret = 0;
>>>>> loff_t pos = *ppos;
>>>>>
>>>>> pos &= VFIO_PCI_OFFSET_MASK;
>>>>>
>>>>> + ret = pm_runtime_resume_and_get(dev);
>>>>> + if (ret < 0)
>>>>> + return ret;
>>>>
>>>> Alternatively we could just check platform_pm_engaged here and return
>>>> -EINVAL, right? Why is waking the device the better option?
>>>>
>>>
>>> This is mainly to prevent race condition where config space access
>>> happens parallelly with IOCTL access. So, lets consider the following case.
>>>
>>> 1. Config space access happens and vfio_pci_config_rw() will be called.
>>> 2. The IOCTL to move into low power state is called.
>>> 3. The IOCTL will move the device into d3cold.
>>> 4. Exit from vfio_pci_config_rw() happened.
>>>
>>> Now, if we just check platform_pm_engaged, then in the above
>>> sequence it won’t work. I checked this parallel access by writing
>>> a small program where I opened the 2 instances and then
>>> created 2 threads for config space and IOCTL.
>>> In my case, I got the above sequence.
>>>
>>> The pm_runtime_resume_and_get() will make sure that device
>>> usage count keep incremented throughout the config space
>>> access (or IOCTL access in the previous patch) and the
>>> runtime PM framework will not move the device into suspended
>>> state.
>>
>> I think we're inventing problems here. If we define that config space
>> is not accessible while the device is in low power and the only way to
>> get the device out of low power is via ioctl, then we should be denying
>> access to the device while in low power. If the user races exiting the
>> device from low power and a config space access, that's their problem.
>>
>
> But what about malicious user who intentionally tries to create
> this sequence. If the platform_pm_engaged check passed and
> then user put the device into low power state. In that case,
> there may be chances where config read happens while the device
> is in low power state.
>

Hi Alex,

I need help in concluding below part to proceed further on my
implementation.

> Can we prevent this concurrent access somehow or make sure
> that nothing else is running when the low power ioctl runs?
>

If I add the 'platform_pm_engaged' alone and return early.

vfio_pci_config_rw()
{
...
down_read(&vdev->memory_lock);
if (vdev->platform_pm_engaged) {
up_read(&vdev->memory_lock);
return -EIO;
}
...
}

Then from user side, if two threads are running then there are chances
that 'platform_pm_engaged' is false while we do check but it gets true
before returning from this function. If runtime PM framework puts the
device into D3cold state, then there are chances that config
read/write happens with D3cold internally. I have added prints in this
function locally at entry and exit. In entry, the 'platform_pm_engaged'
is coming false while in exit it is coming as true, if I create 2
threads from user space. It will be similar to memory access issue
on disabled memory.

So, we need to make sure that the VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
ioctl request should be exclusive and no other config or ioctl
request should be running in parallel.

Could you or someone else please suggest a way to handle this case.

From my side, I have following solution to handle this but not sure if
this will be acceptable and work for all the cases.

1. In real use case, config or any other ioctl should not come along
with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.

2. Maintain some 'access_count' which will be incremented when we
do any config space access or ioctl.

3. At the beginning of config space access or ioctl, we can do
something like this

down_read(&vdev->memory_lock);
atomic_inc(&vdev->access_count);
if (vdev->platform_pm_engaged) {
atomic_dec(&vdev->access_count);
up_read(&vdev->memory_lock);
return -EIO;
}
up_read(&vdev->memory_lock);

And before returning, we can decrement the 'access_count'.

down_read(&vdev->memory_lock);
atomic_dec(&vdev->access_count);
up_read(&vdev->memory_lock);

The atmoic_dec() is put under 'memory_lock' to maintain
lock ordering rules for the arch where atomic_t is internally
implemented using locks.

4. Inside vfio_pci_core_feature_pm(), we can do something like this
down_write(&vdev->memory_lock);
if (atomic_read(&vdev->access_count) != 1) {
up_write(&vdev->memory_lock);
return -EBUSY;
}
vdev->platform_pm_engaged = true;
up_write(&vdev->memory_lock);


5. The idea here is to check the 'access_count' in
vfio_pci_core_feature_pm(). If 'access_count' is greater than 1,
that means some other ioctl or config space is happening,
and we return early. Otherwise, we can set 'platform_pm_engaged' and
release the lock.

6. In case of race condition, if vfio_pci_core_feature_pm() gets
lock and found 'access_count' 1, then its sets 'platform_pm_engaged'.
Now at the config space access or ioctl, the 'platform_pm_engaged'
will get as true and it will return early.

If config space access or ioctl happens first, then
'platform_pm_engaged' will be false and the request will be
successful. But the 'access_count' will be kept incremented till
the last. Now, in vfio_pci_core_feature_pm(), it will get
refcount as 2 and will return -EBUSY.

7. For ioctl access, I need to add two callbacks functions (one
for start and one for end) in the struct vfio_device_ops and call
the same at start and end of ioctl from vfio_device_fops_unl_ioctl().

Another option was to add one more lock like 'memory_lock' and maintain
it throughout the config and ioctl access but maintaining
two locks won't be easy since memory lock is already being
used inside inside config and ioctl.

Thanks,
Abhishek