Re: [PATCH 1/2] KVM: async kvm_destroy_vm for vfio devices

From: Sean Christopherson
Date: Wed Jan 11 2023 - 14:55:07 EST


On Mon, Jan 09, 2023, Matthew Rosato wrote:
> Currently it is possible that the final put of a KVM reference comes from
> vfio during its device close operation. This occurs while the vfio group
> lock is held; however, if the vfio device is still in the kvm device list,
> then the following call chain could result in a deadlock:
>
> kvm_put_kvm
> -> kvm_destroy_vm
> -> kvm_destroy_devices
> -> kvm_vfio_destroy
> -> kvm_vfio_file_set_kvm
> -> vfio_file_set_kvm
> -> group->group_lock/group_rwsem
>
> Avoid this scenario by adding kvm_put_kvm_async which will perform the
> kvm_destroy_vm asynchronously if the refcount reaches 0.

Something feels off. If KVM's refcount is 0, then accessing device->group->kvm
in vfio_device_open() can't happen unless there's a refcounting bug somewhere.

E.g. if this snippet holds group_lock

mutex_lock(&device->group->group_lock);
device->kvm = device->group->kvm;

if (device->ops->open_device) {
ret = device->ops->open_device(device);
if (ret)
goto err_undo_count;
}
vfio_device_container_register(device);
mutex_unlock(&device->group->group_lock);

and kvm_vfio_destroy() has already been invoked and is waiting on group_lock in
vfio_file_set_kvm(), then device->ops->open_device() is being called with a
non-NULL device->kvm that has kvm->users_count==0. And intel_vgpu_open_device()
at least uses kvm_get_kvm(), i.e. assumes kvm->users_count > 0.

If there's no refcounting bug then kvm_vfio_destroy() doesn't need to call
kvm_vfio_file_set_kvm() since the only way there isn't a refcounting bug is if
group->kvm is unreachable. This seems unlikely.

Assuming there is indeed a refcounting issue, one solution would be to harden all
->open_device() implementations to use kvm_get_kvm_safe(). I'd prefer not to have
to do that since it will complicate those implementations and also requires KVM
to support an async put.

Rather than force devices to get KVM references, why not handle that in common
VFIO code and drop KVM refcountin from devices? Worst case scenario KVM is pinned
by a device that doesn't need KVM but is in a group associated with KVM. If that's
a concern, it seems easy enough to add a flag to vfio_device_ops to enumerate
whether or not the device depends on KVM.

---
drivers/vfio/vfio_main.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 6e8804fe0095..fb43212d77a0 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -772,6 +772,13 @@ static struct file *vfio_device_open(struct vfio_device *device)
* reference and release it during close_device.
*/
mutex_lock(&device->group->group_lock);
+
+ if (device->group->kvm &&
+ !kvm_get_kvm_safe(device->group->kvm->kvm)) {
+ ret = -ENODEV;
+ goto err_undo_count;
+ }
+
device->kvm = device->group->kvm;

if (device->ops->open_device) {
@@ -823,8 +830,10 @@ static struct file *vfio_device_open(struct vfio_device *device)
err_undo_count:
mutex_unlock(&device->group->group_lock);
device->open_count--;
- if (device->open_count == 0 && device->kvm)
+ if (device->open_count == 0 && device->kvm) {
+ kvm_put_kvm(device->kvm);
device->kvm = NULL;
+ }
mutex_unlock(&device->dev_set->lock);
module_put(device->dev->driver->owner);
err_unassign_container:
@@ -1039,8 +1048,10 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
}
mutex_unlock(&device->group->group_lock);
device->open_count--;
- if (device->open_count == 0)
+ if (device->open_count == 0 && device->kvm) {
+ kvm_put_kvm(device->kvm);
device->kvm = NULL;
+ }
mutex_unlock(&device->dev_set->lock);

module_put(device->dev->driver->owner);

base-commit: d52444c7a90fc551b4c3b0bda7d3f0b2ca9fc84d
--