Re: RX CRC errors on I219-V (6) 8086:15be

From: Kai-Heng Feng
Date: Wed Jul 03 2019 - 07:33:07 EST


at 02:01, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:

On Tue, Jul 02, 2019 at 04:25:59PM +0800, Kai Heng Feng wrote:
+linux-pci

Hi Sasha,

at 6:49 PM, Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> wrote:

at 14:26, Neftin, Sasha <sasha.neftin@xxxxxxxxx> wrote:

On 6/26/2019 09:14, Kai Heng Feng wrote:
Hi Sasha
at 5:09 PM, Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> wrote:
Hi Jeffrey,

Weâve encountered another issue, which causes multiple CRC
errors and renders ethernet completely useless, hereâs the
network stats:
I also tried ignore_ltr for this issue, seems like it alleviates
the symptom a bit for a while, then the network still becomes
useless after some usage.
And yes, itâs also a Whiskey Lake platform. Whatâs the next step
to debug this problem?
Kai-Heng
CRC errors not related to the LTR. Please, try to disable the ME on
your platform. Hope you have this option in BIOS. Another way is to
contact your PC vendor and ask to provide NVM without ME. Let's
start debugging with these steps.

According to ODM, the ME can be physically disabled by a jumper.
But after disabling the ME the same issue can still be observed.

Weâve found that this issue doesnât happen to SATA SSD, it only happens when
NVMe SSD is in use.

Here are the steps:
- Disable NVMe ASPM, issue persists
- modprobe -r e1000e && modprobe e1000e, issue doesnât happen
- Enabling NVMe ASPM, issue doesnât happen

As long as NVMe ASPM gets enabled after e1000e gets loaded, the issue
doesnât happen.

IIUC the problem happens with the mainline and dev-queue e1000e
driver, but not with the out-of-tree Intel driver. Since there is a
working driver and there's the potential (at least in principle) for
unifying them or bisecting between them, I have limited interest in
debugging it from scratch.

I wonder why disabling ASPM on a device solves another deviceâs issue?
The issue may just get papered over by the âworkingâ driver. Iâd like to understand the root cause behind this symptom.


If it turns out to be a PCI core problem, I would want to know: What's
the PCI topology? "lspci -vv" output for the system? Does it make a
difference if you boot with "pcie_aspm=off"? Collect complete dmesg,
maybe attach it to a kernel.org bugzilla?

Parameter âpcie_aspm=offâ doesnât work for the system.
I need to use "pcie_aspm=forceâ and change the policy to âperformanceâ.
The issue is gone once e1000e loads after ASPM is disabled, either globally or only disabling ASPM on NVMe.

Files attached to https://bugzilla.kernel.org/show_bug.cgi?id=204057

Kai-Heng


/sys/class/net/eno1/statistics$ grep . *
collisions:0
multicast:95
rx_bytes:1499851
rx_compressed:0
rx_crc_errors:1165
rx_dropped:0
rx_errors:2330
rx_fifo_errors:0
rx_frame_errors:0
rx_length_errors:0
rx_missed_errors:0
rx_nohandler:0
rx_over_errors:0
rx_packets:4789
tx_aborted_errors:0
tx_bytes:864312
tx_carrier_errors:0
tx_compressed:0
tx_dropped:0
tx_errors:0
tx_fifo_errors:0
tx_heartbeat_errors:0
tx_packets:7370
tx_window_errors:0

Same behavior can be observed on both mainline kernel and on
your dev-queue branch.
OTOH, the same issue canât be observed on out-of-tree e1000e.

Is there any plan to close the gap between upstream and
out-of-tree version?

Kai-Heng