Re: [PATCH]nvme-pci: Fixes EEH failure on ppc

From: wenxiong
Date: Wed Feb 07 2018 - 15:18:42 EST


On 2018-02-06 19:24, Ming Lei wrote:
On Tue, Feb 06, 2018 at 02:01:05PM -0600, wenxiong wrote:
On 2018-02-06 10:33, Keith Busch wrote:
> On Mon, Feb 05, 2018 at 03:49:40PM -0600, wenxiong@xxxxxxxxxxxxxxxxxxxx
> wrote:
> > @@ -1189,6 +1183,12 @@ static enum blk_eh_timer_return
> > nvme_timeout(struct request *req, bool reserved)
> > struct nvme_command cmd;
> > u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >
> > + /* If PCI error recovery process is happening, we cannot reset or
> > + * the recovery mechanism will surely fail.
> > + */
> > + if (pci_channel_offline(to_pci_dev(dev->dev)))
> > + return BLK_EH_HANDLED;
> > +
>
> This patch will tell the block layer to complete the request and
> consider
> it a success, but it doesn't look like the command actually completed at
> all. You're going to get data corruption this way, right? Is returning
> BLK_EH_HANDLED immediately really the right thing to do here?

Hi Ming,

Can you help checking if it is ok if returning BLK_EH_HANDLEDED in this
case?

Hi Wenxiong,

Looks Keith is correct, and this timed out request will be completed by
block layer and NVMe driver if BLK_EH_HANDLED is returned, but this IO
isn't completed actually, so either data loss(write) or read failure is
caused.

Maybe BLK_EH_RESET_TIMER is fine under this situation.

Thanks,
Ming

Hi Ming,

Thanks! I have tried with BLK_EH_RESET_TIMER and EEH recovery works fine. I am going to resubmit the patch.

Thanks,
Wendy