RE: [PATCH v2 3/3] scsi: ufs: ufs-exynos: implement exynos isr

From: Kiwoong Kim
Date: Tue Sep 14 2021 - 01:13:00 EST


> On 9/13/21 12:55 AM, Kiwoong Kim wrote:
> > This patch is to raise recovery in some abnormal conditions using an
> > vendor specific interrupt for some cases, such as a situation that
> > some contexts of a pending request in the host isn't the same with
> > those of its corresponding UPIUs if they should have been the same
> > exactly.
> >
> > The representative case is shown like below.
> > In the case, a broken UTRD entry, for internal coherent problem or
> > whatever, that had smaller value of PRDT length than expected was
> > transferred to the host.
> > So, the host raised an interrupt of transfer complete even if device
> > didn't finish its data transfer because the host sees a fetched
> > version of UTRD to determine if data tranfer is over or not. Then the
> > application level seemed to recogize this as a sort of corruption and
> > this symptom led to boot failure.
>
> How can a UTRD entry be broken? Does that perhaps indicate memory
> corruption at the host side? Working around host-side memory corruption in
> a driver seems wrong to me. I think the root cause of the memory
> corruption should be fixed.

For SoC internal problems, yes, of course, they should be fixed.
But I don't think the causes always come from inside the system.
They could be outside the system or a device, such as sending DATA IN
with a tag that a host has ever submitted a command with because of
some bugs of the device. You might think putting this sort of code doesn't
make sense but there could be various events that can't be understood in the
point of view of the spec. And chips that is already fab-outed can't be fixed.
That's why I put the details into Exynos. I'm not talking about the spec.
I think Exynos isn't required to contain only things related with the spec
and it can have realistic part.

>
> > +static irqreturn_t exynos_ufs_isr(struct ufs_hba *hba) {
> > + struct exynos_ufs *ufs = ufshcd_get_variant(hba);
> > + u32 status;
> > + unsigned long flags;
> > +
> > + if (!hba->priv) return IRQ_HANDLED;
>
> Please verify patches with checkpatch before posting these on the linux-
> scsi mailing list. The above if-statement does not follow the Linux kernel
> coding style.
>
> > + if (status & RX_UPIU_HIT_ERROR) {
> > + pr_err("%s: status: 0x%08x\n", __func__, status);
> > + hba->force_reset = true;
> > + hba->force_requeue = true;
> > + scsi_schedule_eh(hba->host);
> > + spin_unlock_irqrestore(hba->host->host_lock, flags);
> > + return IRQ_HANDLED;
> > + }
> > + return IRQ_NONE;
> > +}
>
> So the above code unlocks the host_lock depending on whether or not status
> & RX_UPIU_HIT_ERROR is true? Yikes ...
>
> Additionally, in the above code I found the following pattern:
>
> unsigned long flags;
> [ ... ]
> spin_unlock_irqrestore(hba->host->host_lock, flags);
>
> Such code is ALWAYS wrong. The value of the 'flags' argument passed to
> spin_unlock_irqrestore() must come from spin_lock_irqsave().
>
> Bart.

I missed for two things. Thanks, I'll look more carefully.