RE: [PATCH v3 2/6] PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic

From: Dexuan Cui
Date: Thu Jun 15 2023 - 00:27:56 EST

Next message: Ian Rogers: "Re: [PATCH V2 4/8] perf metrics: Sort the Default metricgroup"
Previous message: Gautham R. Shenoy: "Re: [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> From: Lorenzo Pieralisi <lpieralisi@xxxxxxxxxx>
> Sent: Thursday, May 25, 2023 3:16 AM
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -643,6 +643,11 @@ static void hv_arch_irq_unmask(struct irq_data
> > *data)
> > pbus = pdev->bus;
> > hbus = container_of(pbus->sysdata, struct hv_pcibus_device, sysdata);
> > int_desc = data->chip_data;
> > + if (!int_desc) {
> > + dev_warn(&hbus->hdev->device, "%s() can not unmask irq %u\n",
> > + __func__, data->irq);
> > + return;
> > + }
>
> That's a check that should be there regardless ?
Yes. Normally data->chip_data is set at the end of hv_compose_msi_msg(),
and it should not be NULL. However, in rare circumstances, we might see a
failure in hv_compose_msi_msg(), e.g. the hypervisor/host might return an
error in comp.comp_pkt.completion_status (at least this is possible in theory).

In case we see a failure in hv_compose_msi_msg(), data->chip_data stays
with its default value of NULL; because the return type of
hv_compose_msi_msg() is "void", we can not return an error to the upper
layer; later, when the upper layer calls request_irq() -> ... -> hv_irq_unmask(),
hv_arch_irq_unmask() crashes because data->chip_data is NULL -- with this
check, we're able to error out gracefully, and the user can better understand
what goes wrong.

> > spin_lock_irqsave(&hbus->retarget_msi_interrupt_lock, flags);
> >
> > @@ -1911,12 +1916,6 @@ static void hv_compose_msi_msg(struct
> irq_data *data, struct msi_msg *msg)
> > hv_pci_onchannelcallback(hbus);
> > spin_unlock_irqrestore(&channel->sched_lock, flags);
> >
> > - if (hpdev->state == hv_pcichild_ejecting) {
> > - dev_err_once(&hbus->hdev->device,
> > - "the device is being ejected\n");
> > - goto enable_tasklet;
> > - }
> > -
> > udelay(100);
> > }
>
> I don't understand why this code is in hv_compose_msi_msg() in the first
> place (and why only in that function ?) to me this looks like you are
> adding plasters in the code that can turn out to be problematic while
> ejecting a device, this does not seem robust at all - that's my opinion.

The code was incorrectly added by
de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()")

de0aa7b2f97d says

"
2. If the host is ejecting the VF device before we reach
hv_compose_msi_msg(), in a UP VM, we can hang in hv_compose_msi_msg()
forever, because at this time the host doesn't respond to the
CREATE_INTERRUPT request.
"

de0aa7b2f97d implies that the host doesn't respond to the guest's
CREATE_INTERRUPT request once the guest receives the PCI_EJECT message -- this
is incorrect: after the guest receives the PCI_EJECT message, actually the host
still responds to the guest's request, as long as the guest sends the request within
1 minute AND the guest doesn't send a PCI_EJECTION_COMPLETE message to
the host in hv_eject_device_work(). The real issue is that currently the guest
can send PCI_EJECTION_COMPLETE to the host before the guest finishes the
device-probing/removing handling -- once the guest sends PCI_EJECTION_COMPLETE,
the host unassigns the PCI device from the guest and ignores any request
from the guest.

So here the check "hpdev->state == hv_pcichild_ejecting" is incorrect. We
should remove the check since it can cause a panic (see the commit messsage
for the detailed explanation)

The "premature PCI_EJECTION_COMPLETE" issue is resolved by:
[PATCH v3 5/6] PCI: hv: Add a per-bus mutex state_lock

> Feel free to merge this code, I can't ACK it, sorry.
>
> Lorenzo
Thanks for sharing the thougths!

Next message: Ian Rogers: "Re: [PATCH V2 4/8] perf metrics: Sort the Default metricgroup"
Previous message: Gautham R. Shenoy: "Re: [RFC PATCH 0/4] Limit the scan depth to find the busiest sched group during newidle balance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]