Re: [PATCH V3 1/1] nvme: Add quirk for LiteON CL1 devices running FW 220TQ,22001

From: Christoph Hellwig
Date: Tue Nov 03 2020 - 04:23:24 EST


On Tue, Nov 03, 2020 at 02:21:16AM +0000, Gloria Tsai wrote:
> When host issue shutdown + D3hot in suspend, NVMe drive might have
> chance choosing wrong pointer which has already been used by GC then
> cause over program.
> Do GC before shutdown -> delete IO Q -> shutdown from host -> breakup GC -> D3hot -> enter PS4 -> have a chance swap block -> use wrong pointer on device SRAM -> over program

Aka there is data corruption?

> The issue only happens in simple suspend (shutdown+D3hot) with specific FW on Kahoku board.

Kahoku is a specific LiteOn controller? Or it is the host system?

Maybe main issue with this patch is that it mixes up two axis:

- use power states for suspend despite HMB on specific host systems
identified by the DMI ids. This kinda makes sense to me, as
the power state based suspends has lots of advantages, so having
a whitelist when to use it seem ok, despite the clutter that this
causes.
- then tie this to specific NVMe devices that don't work without this
quirk, which leaves open the issue what we do when we encounter such
a device in a different host system. If shutdown + D3hot causes
problems there is seems like for the case where the above quirk doesn't
apply we should just skip the shutdown and let the D3hot do a surprise
power removal? That's mean recovery when coming back from the suspend,
but would cause less corruption?

In other words I think this needs to be two patches:

1) quirk based on the DMI table and allow power state based
suspend on given host systems even when a HMB is enabled
2) quirk based on the nvme device (and if possible use the PCI IDs)
to disable shutdown before suspend (possibly with a warning printk
when this happens)