Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()

From: Nicolin Chen
Date: Fri Feb 10 2023 - 19:10:50 EST


On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:

> > > > + if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > > > + if (list_empty(&hwpt->devices)) {
> > > > + iopt_table_remove_domain(&hwpt->ioas->iopt,
> > > > + hwpt->domain);
> > > > + list_del(&hwpt->hwpt_item);
> > > > + }
> > >
> > > I'm not sure how this can be fully shared between detach and replace.
> > > Here some work e.g. above needs to be done before calling
> > > iommu_group_replace_domain() while others can be done afterwards.
> >
> > This iopt_table_remove_domain/list_del is supposed to be done in
> > the hwpt's destroy() actually. We couldn't move it because it'd
> > need the new domain_alloc_user op and its implementation in ARM
> > driver. Overall, I think it should be safe to put it behind the
> > iommu_group_replace_domain().
> >
>
> My confusion is that we have different flows between detach/attach
> and replace.
>
> today with separate detach+attach we have following flow:
>
> Remove device from current hwpt;
> if (last_device in hwpt) {
> Remove hwpt domain from current iopt;
> if (last_device in group)
> detach group from hwpt domain;
> }
>
> if (first device in group) {
> attach group to new hwpt domain;
> if (first_device in hwpt)
> Add hwpt domain to new iopt;
> Add device to new hwpt;
>
> but replace flow is different on the detach part:
>
> if (first device in group) {
> replace group's domain from current hwpt to new hwpt;
> if (first_device in hwpt)
> Add hwpt domain to new iopt;
> }
>
> Remove device from old hwpt;
> if (last_device in old hwpt)
> Remove hwpt domain from old iopt;
>
> Add device to new hwpt;

Oh... thinking it carefully, I see the flow does look a bit off.
Perhaps it's better to have a similar flow for replace.

However, I think something would be still different due to its
tricky nature, especially for a multi-device iommu_group.

An iommu_group_detach happens only when a device is the last one
in its group to go through the routine via a DETACH ioctl, while
an iommu_group_replace_domain() happens only when the device is
the first one in its group to go through the routine via another
ATTACH ioctl. However, when the first device does a replace, the
cleanup routine of the old hwpt is a NOP, since there are still
other devices (same group) in the old hwpt. And two implications
here:
1) Any other device in the same group has to forcibly switch to
the new domain, when the first device does a replace.
2) The actual hwpt cleanup can only happen at the last device's
replace call.

This also means that kernel has to rely on the integrity of the
user space that it must replace all active devices in the group:

For a three-device iommu_group,
[scenario 1]
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. ATTACH (REPLACE) dev1 to hwpt2;
(no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do
(no hwpt1 cleanup; no dev2 replace; no hwpt2 init)
f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do
(do hwpt1 cleanup; no dev3 replace; no hwpt2 init)

[scenario 2]
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. DETACH dev3 from hwpt1;
(detach dev3; no hwpt1 cleanup)
f. ATTACH (REPLACE) dev1 to hwpt2;
(no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
g. ATTACH (REPLACE) dev2 to hwpt2; // user space must do
(do hwpt1 cleanup; no dev2 replace; no hwpt2 init)
h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE
(no hwpt1 cleanup; no dev3 replace; no hwpt2 init)

[scenario 3]
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. DETACH dev3 from hwpt1;
(detach dev3; no hwpt1 cleanup)
e. DETACH dev2 from hwpt1;
(detach dev2; no hwpt1 cleanup)
f. ATTACH (REPLACE) dev1 to hwpt2;
(do hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
g. (optional) ATTACH dev2 to hwpt2; // clean ATTACH, not a REPLACE
(no hwpt1 cleanup; no dev2 replace; no hwpt2 init)
h. (optional) ATTACH dev3 to hwpt2; // clean ATTACH, not a REPLACE
(no hwpt1 cleanup; no dev3 replace; no hwpt2 init)

Following the narratives above,

[current detach+attach flow]
// DETACH dev1 from hwpt1;
Log dev1 out of the hwpt1's device list;
NOP; // hwpt1 has its group;
iopt_remove_reserved_iova(hwpt1->iopt, dev1);
idev1->hwpt = NULL;
refcount_dec();
// DETACH dev2 from hwpt1;
Log dev2 out of the hwpt1's device list;
if (hwpt1 does not have its group) { // last device to detach
if (hwpt1's device list is empty)
iopt_table_remove_domain/list_del(hwpt1);
iommu_detach_group();
}
iopt_remove_reserved_iova(hwpt1->iopt, dev2);
idev2->hwpt = NULL;
refcount_dec();
...
// ATTACH dev1 to hwpt2;
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1);
if (hwpt2 does not have its group) { // first device to attach
iommu_attach_group();
if (hwpt2's device list is empty)
iopt_table_add_domain/list_add(hwpt2);
}
idev1->hwpt = hwpt2;
refcount_inc();
Log dev1 in the hwpt2's device list;
// ATTACH dev2 to hwpt2;
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2);
NOP; // hwpt2 has its group;
idev2->hwpt = hwpt2;
refcount_inc();
Log dev2 in to the hwpt2's device list;


[correct (?) replace flow - scenario 1 above]

// 1.d Switch (REPLACE) dev1 from hwpt1 to hwpt2;
partial detach (dev1) {
Log dev1 out of the hwpt1's device list;
NOP // hwpt1 has its group, and hwpt1's device list isn't empty
iopt_remove_reserved_iova(hwpt1->iopt, dev1);
refcount_dec();
}
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1);
if (hwpt2 does not have its group) { // first device to replace
iommu_group_replace_domain();
if (hwpt2's device list is empty)
iopt_table_add_domain/list_add(hwpt2);
}
idev1->hwpt = hwpt2;
refcount_int();
Log dev1 in the hwpt2's device list;

// 1.e Switch (REPLACE) dev2 from hwpt1 to hwpt2;
partial detach (dev2) {
Log dev2 out of the hwpt1's device list;
NOP // hwpt1 has its group, and hwpt1's device list isn't empty
iopt_remove_reserved_iova(hwpt1->iopt, dev2);
refcount_dec();
}
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2);
NOP; // hwpt2 has its group, and hwpt2's device list isn't empty
idev2->hwpt = hwpt2;
refcount_int();
Log dev2 in the hwpt2's device list;

// 1.f Switch (REPLACE) dev3 from hwpt1 to hwpt2;
partial detach (dev3) {
Log dev3 out of the hwpt1's device list;
if (hwpt1 does not have its group) { // last device to detach
if (hwpt1's device list is empty)
iopt_table_remove_domain/list_del(hwpt1);
}
iopt_remove_reserved_iova(hwpt1->iopt, dev3);
refcount_dec();
}
iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev3);
NOP; // hwpt2 has its group, and hwpt2's device list isn't empty
idev3->hwpt = hwpt2;
refcount_int();
Log dev3 in the hwpt2's device list;

And, this would also complicate the error-out routines too...

> I'm yet to figure out whether we have sufficient lock protection to
> prevent other paths from using old iopt/hwpt to find the device
> which is already attached to a different domain.

With the correct (?) flow, I think it'd be safer for one-device
group. But it's gets tricker for the multi-device case above:
the dev2 and dev3 are already in the new domain, but all their
iommufd objects (idev->hwpt and iopt) are lagging. Do we need a
lock across the three IOCTLs?

One way to avoid it is to force-update idev2 and idev3 too when
idev1 does a replace -- by iterating all same-group devices out
of the old hwpt. This, however, feels a violation against being
device-centric...

i.e. scenario 1:
a. ATTACH dev1 to hwpt1;
b. ATTACH dev2 to hwpt1;
c. ATTACH dev3 to hwpt1;
d. ATTACH (REPLACE) dev1 to hwpt2;
(do hwpt1 cleanup; replace dev2&3 and update idev2&3 too; do hwpt2 init)
e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do, or be aware of 1.d
(kernel does dummy)
f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do, or be aware of 1.d
(kernel does dummy)

Thanks
Nic