Re: Regression in Kernel 6.0: System partially freezes with "nvme controller is down"

From: Bjorn Helgaas
Date: Thu Jan 12 2023 - 12:11:17 EST


On Thu, Jan 12, 2023 at 03:48:46PM +0100, Linux kernel regression tracking (Thorsten Leemhuis) wrote:
> ...
> On 11.01.23 23:11, Julian Groß wrote:
> > Dear Maintainer,
> >
> > when running Linux Kernel version 6.0.12, 6.0.10, 6.0-rc7, or 6.1.4, my
> > system seemingly randomly freezes due to the file system being set to
> > read-only due to an issue with my NVMe controller.
> > The issue does *not* appear on Linux Kernel version 5.19.11 or lower.
> >
> > Through network logging I am able to catch the issue:
> > ```
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.259288] nvme nvme0:
> > controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Does
> > your device have a faulty power saving mode enabled?
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.259293] nvme nvme0: Try
> > "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
> > Jan  8 14:50:16 x299-desktop kernel: [ 1461.331360] nvme 0000:01:00.0:
> > enabling device (0000 -> 0002)
> > ...
> >
> > I have tried the suggestion in the log without luck.
> >
> > Attached is a log that includes two system freezes, as well as a list of
> > PCI(e) devices created by Debian reportbug.
> > The first freeze happens at "Jan  8 04:26:28" and the second freeze
> > happens at "Jan  8 14:50:16".
> >
> > Currently, I am using git bisect to narrow down the window of possible
> > commits, but since the issue appears seemingly random, it will take many
> > months to identify the offending commit this way.
> >
> > The original Debian bug report is here:
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1028309

For some reason the log [1] has very little of the kernel dmesg log.
It does seem like the freeze is partial (I see messages for hundreds
or thousands of seconds after the nvme reset), but requires a reboot
to recover.

The lspci information [2] shows the 00:1b.0 Root Port leading to the
01:00.0 NVMe device.

Is it possible to collect lspci output after the nvme freeze? If so,
please save the output of:

sudo lspci -vv -s00:1b.0
sudo lspci -vv -s01:00.0

Make sure to run lspci as root so we can see the error logging
registers for these devices.

If you can collect more of the dmesg log after the freeze, e.g., via
the "dmesg" command, that might be helpful, too.

Bjorn

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=1028309;filename=x299-desktop_crash.log.xz;msg=5
[2] https://bugs.debian.org/cgi-bin/bugreport.cgi?att=0;bug=1028309;msg=5