Re: [PATCH v6 4/5] PCI: only return true when dev io state is really changed

From: Ethan Zhao
Date: Sat Oct 03 2020 - 03:06:10 EST


Bjorn,

On Sat, Oct 3, 2020 at 1:29 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> [+cc Sinan]
>
> On Wed, Sep 30, 2020 at 03:05:36AM -0400, Ethan Zhao wrote:
> > When uncorrectable error happens, AER driver and DPC driver interrupt
> > handlers likely call
> >
> > pcie_do_recovery()
> > ->pci_walk_bus()
> > ->report_frozen_detected()
> >
> > with pci_channel_io_frozen the same time.
> > If pci_dev_set_io_state() return true even if the original state is
> > pci_channel_io_frozen, that will cause AER or DPC handler re-enter
> > the error detecting and recovery procedure one after another.
> > The result is the recovery flow mixed between AER and DPC.
> > So simplify the pci_dev_set_io_state() function to only return true
> > when dev->error_state is changed.
> >
> > Signed-off-by: Ethan Zhao <haifeng.zhao@xxxxxxxxx>
> > Tested-by: Wen Jin <wen.jin@xxxxxxxxx>
> > Tested-by: Shanshan Zhang <ShanshanX.Zhang@xxxxxxxxx>
> > Reviewed-by: Alexandru Gagniuc <mr.nuke.me@xxxxxxxxx>
> > Reviewed-by: Andy Shevchenko <andy.shevchenko@xxxxxxxxx>
> > ---
> > Changnes:
> > v2: revise description and code according to suggestion from Andy.
> > v3: change code to simpler.
> > v4: no change.
> > v5: no change.
> > v6: no change.
> >
> > drivers/pci/pci.h | 37 +++++--------------------------------
> > 1 file changed, 5 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 455b32187abd..f2beeaeda321 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -359,39 +359,12 @@ struct pci_sriov {
> > static inline bool pci_dev_set_io_state(struct pci_dev *dev,
> > pci_channel_state_t new)
> > {
> > - bool changed = false;
> > -
> > device_lock_assert(&dev->dev);
> > - switch (new) {
> > - case pci_channel_io_perm_failure:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - case pci_channel_io_perm_failure:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - case pci_channel_io_frozen:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - case pci_channel_io_normal:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - }
> > - if (changed)
> > - dev->error_state = new;
> > - return changed;
> > + if (dev->error_state == new)
> > + return false;
> > +
> > + dev->error_state = new;
> > + return true;
> > }
>
> IIUC this changes the behavior of the function, but it's difficult to
> analyze because it does a lot of simplification at the same time.
>
> Please consider the following, which is intended to simplify the
> function while preserving the behavior (but please verify; it's been a
> long time since I looked at this). Then maybe see how your patch
> could be done on top of this?
>
> Alternatively, come up with your own simplification patch + the
> functionality change.
>
>
> commit 983d9b1f8177 ("PCI/ERR: Simplify pci_dev_set_io_state()")
> Author: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> Date: Tue May 19 12:28:57 2020 -0500
>
> PCI/ERR: Simplify pci_dev_set_io_state()
>
> Truth table:
>
> requested new state
> current ------------------------------------------
> state normal frozen perm_failure
> ------------ + ------------- ------------- ------------
> normal | normal frozen perm_failure
> frozen | normal frozen perm_failure
> perm_failure | perm_failure* perm_failure* perm_failure
>
> * "not changed", returns false
>
> No functional change intended.
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 6d3f75867106..81408552f7c9 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -358,39 +358,21 @@ struct pci_sriov {
> static inline bool pci_dev_set_io_state(struct pci_dev *dev,
> pci_channel_state_t new)
> {
> - bool changed = false;
> -
> device_lock_assert(&dev->dev);
> - switch (new) {
> - case pci_channel_io_perm_failure:
> - switch (dev->error_state) {
> - case pci_channel_io_frozen:
> - case pci_channel_io_normal:
> - case pci_channel_io_perm_failure:
> - changed = true;
> - break;
> - }
> - break;
> - case pci_channel_io_frozen:
> - switch (dev->error_state) {
> - case pci_channel_io_frozen:
> - case pci_channel_io_normal:
> - changed = true;
> - break;
> - }
> - break;
> - case pci_channel_io_normal:
> - switch (dev->error_state) {
> - case pci_channel_io_frozen:
> - case pci_channel_io_normal:
> - changed = true;
> - break;
> - }
> - break;
> +
> + /* Can always put a device in perm_failure state */
> + if (new == pci_channel_io_perm_failure) {
> + dev->error_state == pci_channel_io_perm_failure;
> + return true;
> }
> - if (changed)
> - dev->error_state = new;
> - return changed;
> +
> + /* If already in perm_failure, can't set to normal or frozen */
> + if (dev->error_state == pci_channel_io_perm_failure)
> + return false;
This line goes to the first.
> +
> + /* Can always change normal to frozen or vice versa */
> + dev->error_state = new;
> + return true;
> }
Per code grepping, there are two kinds of cases
pci_dev_set_io_state() are called.
1. doesn't care about the return value, just set it to
pci_chnnel_io_perm_failure.
and later handling depends on dev->error_state, whatever the
value changed or not.
pciehp_unconfigure_device()
-> pci_walk_bus()
-> pci_dev_set_disconnected()
->pci_dev_set_io_state() with pci_chnnel_io_perm_failure

2. cares about the return value or the dev->error_state is changed or not. and
the later handling also depends on if the value of dev->error_state
is changed or not.

pcie_do_recovery()
->pci_walk_bus()
->report_frozen_detected()
->report_error_detected()
->pci_dev_set_io_state() with pci_channel_io_frozen.

This case, if the original value was pci_channel_io_frozen, no
need to set it again
and say it is *changed*. ----- return true only it is really changed.

pcie_do_recovery()
->pci_walk_bus()
->report_resume()
->pcI_dev_set_ios_state() with pci_channel_io_normal.

The same.

pcie_do_recovery()
->pci_walk_bus()
->report_normal_detected() with pci_channel_io_normal

The same.

I will combat the 'Truth Table' to keep the original functions
and let it return true
only when it really changes to avoid re-enter the later handling process.

Will send v7 for your review.

Thanks,
Ethan
>
> static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)