Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle

From: Kai Heng Feng
Date: Fri May 10 2019 - 11:15:01 EST




> On May 10, 2019, at 4:23 PM, Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
>
> On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng
> <kai.heng.feng@xxxxxxxxxxxxx> wrote:
>>
>> at 06:19, <Mario.Limonciello@xxxxxxxx> <Mario.Limonciello@xxxxxxxx> wrote:
>>
>>>> -----Original Message-----
>>>> From: Keith Busch <kbusch@xxxxxxxxxx>
>>>> Sent: Thursday, May 9, 2019 4:54 PM
>>>> To: Limonciello, Mario
>>>> Cc: kai.heng.feng@xxxxxxxxxxxxx; hch@xxxxxx; axboe@xxxxxx;
>>>> sagi@xxxxxxxxxxx; rafael@xxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx;
>>>> rafael.j.wysocki@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
>>>> nvme@xxxxxxxxxxxxxxxxxxx; keith.busch@xxxxxxxxx
>>>> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead
>>>> of D3 on
>>>> Suspend-to-Idle
>>>>
>>>>
>>>> [EXTERNAL EMAIL]
>>>>
>>>> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@xxxxxxxx
>>>> wrote:
>>>>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
>>>>>> +{
>>>>>> + int ret;
>>>>>> +
>>>>>> + mutex_lock(&ctrl->scan_lock);
>>>>>> + nvme_start_freeze(ctrl);
>>>>>> + nvme_wait_freeze(ctrl);
>>>>>> + ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
>>>>>> + NULL);
>>>>>> + nvme_unfreeze(ctrl);
>>>>>> + mutex_unlock(&ctrl->scan_lock);
>>>>>> +
>>>>>> + return ret;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(nvme_set_power);
>>>>>
>>>>> I believe without memory barriers at the end disks with HMB this will
>>>>> still kernel panic (Such as Toshiba BG3).
>>>>
>>>> Well, the mutex has an implied memory barrier, but your HMB explanation
>>>> doesn't make much sense to me anyway. The "mb()" in this thread's original
>>>> patch is a CPU memory barrier, and the CPU had better not be accessing
>>>> HMB memory. Is there something else going on here?
>>>
>>> Kai Heng will need to speak up a bit in his time zone as he has this disk
>>> on hand,
>>> but what I recall from our discussion was that DMA operation MemRd after
>>> resume was the source of the hang.
>>
>> Yes, thatâ what I was told by the NVMe vendor, so all I know is to impose a
>> memory barrier.
>> If mb() shouldnât be used here, whatâs the correct variant to use in this
>> context?
>>
>>>
>>>>> This still allows D3 which we found at least failed to go into deepest
>>>>> state and
>>>> blocked
>>>>> platform s0ix for the following SSDs (maybe others):
>>>>> Hynix PC601
>>>>> LiteOn CL1
>>>>
>>>> We usually write features to spec first, then quirk non-compliant
>>>> devices after.
>>>
>>> NVME spec doesn't talk about a relationship between SetFeatures w/
>>> NVME_FEAT_POWER_MGMGT and D3 support, nor order of events.
>>>
>>> This is why we opened a dialog with storage vendors, including
>>> contrasting the behavior
>>> of Microsoft Windows inbox NVME driver and Intel's Windows RST driver.
>>>
>>> Those two I mention that come to mind immediately because they were most
>>> recently
>>> tested to fail. Our discussion with storage vendors overwhelmingly
>>> requested
>>> that we don't use D3 under S2I because their current firmware
>>> architecture won't
>>> support it.
>>>
>>> For example one vendor told us with current implementation that receiving
>>> D3hot
>>> after NVME shutdown will prevent being able to enter L1.2. D3hot entry
>>> was supported
>>> by an IRQ handler that isn't serviced in NVME shutdown state.
>>>
>>> Another vendor told us that with current implementation it's impossible
>>> to transition
>>> to PS4 (at least via APST) while L1.2 D3hot is active.
>>
>> I tested the patch from Keith and it has two issues just as simply skipping
>> nvme_dev_disable():
>> 1) It consumes more power in S2I
>> 2) System freeze after resume
>
> Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from
> asking for D3 and both of the symptoms above may be consequences of
> that in principle.

Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3e4fb891a95a..ece428ce6876 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -18,6 +18,7 @@
#include <linux/mutex.h>
#include <linux/once.h>
#include <linux/pci.h>
+#include <linux/suspend.h>
#include <linux/t10-pi.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
@@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev)
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);

+ if (!pm_suspend_via_firmware()) {
+ nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
+ pci_save_state(pdev);
+ }
+
nvme_dev_disable(ndev, true);
return 0;
}
@@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev)
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);

+ if (!pm_resume_via_firmware()) {
+ return nvme_set_power(&ndev->ctrl, 0);
+ }
+
nvme_reset_ctrl(&ndev->ctrl);
return 0;
}

Does pci_save_state() here enough to prevent the device enter to D3?

Kai-Heng