intel i225 NIC loses PCIe link, network becomes unusable)

From: Arno Lehmann
Date: Mon Feb 12 2024 - 05:39:54 EST


Hello everybody,

I'm struggling with the problem named in the subject.

Originally reported to the debian bug tracker; you'll find the history here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1060706

Infrequently, and apparently randomly, I have the situation that the PCIe link for the NIC is lost. Obviously, the network then becomes unusable. rmmod / modprobe'ing the igc module does not resolve this problem, a reboot is necessary.

I noticed this initially when installing the system last year, did a bit of a search, found that the kernel option 'pcie_aspm=off' was supposed to be useful, set that, and have that enabled ever since.

The problem persists.

Most recent case is this one:

[So Feb 11 15:47:18 2024] igc 0000:0b:00.0 eno1: NIC Link is Down
[So Feb 11 15:47:21 2024] igc 0000:0b:00.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[So Feb 11 16:52:01 2024] igc 0000:0b:00.0 eno1: NIC Link is Down
[So Feb 11 16:52:05 2024] igc 0000:0b:00.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

(I have no idea if the above to events have any relevance.)

[So Feb 11 18:47:59 2024] igc 0000:0b:00.0 eno1: PCIe link lost, device now detached
[So Feb 11 18:47:59 2024] ------------[ cut here ]------------
[So Feb 11 18:47:59 2024] igc: Failed to read reg 0xc030!
[So Feb 11 18:47:59 2024] WARNING: CPU: 20 PID: 136256 at drivers/net/ethernet/intel/igc/igc_main.c:6583 igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] Modules linked in: rfcomm cpufreq_userspace cpufreq_powersave cpufreq_ondemand cpufreq_conservative nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs overlay qrtr cmac algif_hash algif_skcipher af_alg bnep sunrpc binfmt_misc nls_ascii nls_cp437 vfat fat ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common btusb btrtl btbcm btintel btmtk bluetooth mt7921e snd_hda_codec_hdmi mt7921_common mt76_connac_lib edac_mce_amd snd_hda_intel mt76 snd_intel_dspcfg kvm_amd snd_intel_sdw_acpi sha3_generic mac80211 jitterentropy_rng snd_usb_audio uvcvideo snd_hda_codec drbg libarc4 videobuf2_vmalloc snd_usbmidi_lib asus_nb_wmi eeepc_wmi kvm uvc videobuf2_memops snd_rawmidi ansi_cprng snd_hda_core asus_wmi videobuf2_v4l2 snd_seq_device snd_hwdep ecdh_generic irqbypass battery ecc ledtrig_audio videodev snd_pcm sparse_keymap cfg80211 crc16 rapl snd_timer videobuf2_common platform_profile wmi_bmof sp5100_tco pcspkr snd ccp mc watchdog k10temp soundcore rfkill joydev sg evdev msr
[So Feb 11 18:47:59 2024] parport_pc ppdev lp parport fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 xfs libcrc32c crc32c_generic sd_mod dm_crypt dm_mod uas usb_storage hid_generic amdgpu amdxcp drm_buddy gpu_sched usbhid i2c_algo_bit drm_suballoc_helper hid drm_display_helper sr_mod cdrom cec rc_core crc32_pclmul drm_ttm_helper crc32c_intel ghash_clmulni_intel ttm ahci sha512_ssse3 sha512_generic libahci nvme xhci_pci drm_kms_helper libata xhci_hcd nvme_core drm aesni_intel t10_pi usbcore scsi_mod crypto_simd crc64_rocksoft_generic igc cryptd crc64_rocksoft crc_t10dif crct10dif_generic i2c_piix4 crct10dif_pclmul crc64 crct10dif_common scsi_common usb_common video wmi gpio_amdpt gpio_generic button
[So Feb 11 18:47:59 2024] CPU: 20 PID: 136256 Comm: kworker/20:0 Not tainted 6.5.0-0.deb12.4-amd64 #1 Debian 6.5.10-1~bpo12+1
[So Feb 11 18:47:59 2024] Hardware name: ASUS System Product Name/ROG STRIX X670E-A GAMING WIFI, BIOS 1904 01/29/2024
[So Feb 11 18:47:59 2024] Workqueue: events igc_watchdog_task [igc]
[So Feb 11 18:47:59 2024] RIP: 0010:igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] Code: 48 c7 c6 10 76 36 c0 e8 81 6a c1 d5 48 8b bb 28 ff ff ff e8 05 d2 97 d5 84 c0 74 bc 89 ee 48 c7 c7 38 76 36 c0 e8 c3 ee 36 d5 <0f> 0b eb aa b8 ff ff ff ff e9 15 cf e7 d5 0f 1f 44 00 00 90 90 90
[So Feb 11 18:47:59 2024] RSP: 0018:ffffa203cfe8fdd8 EFLAGS: 00010282
[So Feb 11 18:47:59 2024] RAX: 0000000000000000 RBX: ffff961b5c75ccb8 RCX: 0000000000000027
[So Feb 11 18:47:59 2024] RDX: ffff962a5e7213c8 RSI: 0000000000000001 RDI: ffff962a5e7213c0
[So Feb 11 18:47:59 2024] RBP: 000000000000c030 R08: 0000000000000000 R09: ffffa203cfe8fc68
[So Feb 11 18:47:59 2024] R10: 0000000000000003 R11: ffff962a9de3ac28 R12: ffff961b5c75c000
[So Feb 11 18:47:59 2024] R13: 0000000000000000 R14: ffff961b54c92d40 R15: 000000000000c030
[So Feb 11 18:47:59 2024] FS: 0000000000000000(0000) GS:ffff962a5e700000(0000) knlGS:0000000000000000
[So Feb 11 18:47:59 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[So Feb 11 18:47:59 2024] CR2: 00007fb76de93000 CR3: 00000001153d0000 CR4: 0000000000750ee0
[So Feb 11 18:47:59 2024] PKRU: 55555554
[So Feb 11 18:47:59 2024] Call Trace:
[So Feb 11 18:47:59 2024] <TASK>
[So Feb 11 18:47:59 2024] ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] ? __warn+0x81/0x130
[So Feb 11 18:47:59 2024] ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] ? report_bug+0x171/0x1a0
[So Feb 11 18:47:59 2024] ? srso_alias_return_thunk+0x5/0x7f
[So Feb 11 18:47:59 2024] ? prb_read_valid+0x1b/0x30
[So Feb 11 18:47:59 2024] ? handle_bug+0x41/0x70
[So Feb 11 18:47:59 2024] ? exc_invalid_op+0x17/0x70
[So Feb 11 18:47:59 2024] ? asm_exc_invalid_op+0x1a/0x20
[So Feb 11 18:47:59 2024] ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] ? igc_rd32+0x8d/0xa0 [igc]
[So Feb 11 18:47:59 2024] igc_update_stats+0x8a/0x6d0 [igc]
[So Feb 11 18:47:59 2024] igc_watchdog_task+0x9d/0x4a0 [igc]
[So Feb 11 18:47:59 2024] process_one_work+0x1df/0x3e0
[So Feb 11 18:47:59 2024] worker_thread+0x51/0x390
[So Feb 11 18:47:59 2024] ? __pfx_worker_thread+0x10/0x10
[So Feb 11 18:47:59 2024] kthread+0xe5/0x120
[So Feb 11 18:47:59 2024] ? __pfx_kthread+0x10/0x10
[So Feb 11 18:47:59 2024] ret_from_fork+0x31/0x50
[So Feb 11 18:47:59 2024] ? __pfx_kthread+0x10/0x10
[So Feb 11 18:47:59 2024] ret_from_fork_asm+0x1b/0x30
[So Feb 11 18:47:59 2024] </TASK>
[So Feb 11 18:47:59 2024] ---[ end trace 0000000000000000 ]---


With the guidance from the friendly folks at the debian bug tracker, we could find that this happens with many kernel versions, as can be derived from the following (condensed list below):

# journalctl --grep '(Linux version|PCIe link lost)' --quiet | cat
Aug 30 18:16:18 Zwerg kernel: Linux version 6.1.0-11-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08)
Sep 20 14:21:17 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Sep 20 19:47:06 Zwerg kernel: Linux version 6.1.0-11-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-4 (2023-08-08)
Okt 04 17:16:08 Zwerg kernel: Linux version 6.1.0-12-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07)
Okt 06 05:44:20 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Okt 07 16:39:10 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
Okt 07 16:43:41 Zwerg kernel: Linux version 6.1.0-12-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07)
Okt 23 18:23:54 Zwerg kernel: Linux version 6.1.0-12-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07)
Okt 23 18:31:25 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Okt 23 18:48:58 Zwerg kernel: Linux version 6.1.0-13-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29)
Okt 30 11:16:06 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Okt 31 13:50:06 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
Okt 31 13:52:01 Zwerg kernel: Linux version 6.1.0-13-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29)
Nov 22 18:59:11 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Nov 23 12:18:19 Zwerg kernel: Linux version 6.1.0-13-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29)
Nov 23 15:45:49 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Nov 23 15:52:51 Zwerg kernel: Linux version 6.1.0-13-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29)
Dez 06 19:06:18 Zwerg kernel: Linux version 6.1.0-13-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29)
Dez 09 15:12:13 Zwerg kernel: Linux version 6.1.0-14-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.64-1 (2023-11-30)
Dez 19 07:33:02 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Dez 20 10:29:21 Zwerg kernel: Linux version 6.1.0-15-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09)
Jan 01 09:57:40 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Jan 02 13:41:33 Zwerg kernel: Linux version 6.1.0-15-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09)
Jan 10 16:15:20 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Jan 13 11:02:41 Zwerg kernel: Linux version 6.1.0-17-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)
Jan 13 11:16:31 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Jan 13 11:18:13 Zwerg kernel: Linux version 6.1.0-17-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)
Jan 19 14:25:08 Zwerg kernel: Linux version 6.1.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-13) 12.2.0, GNU ld (GNU Binutils for Debian) 2.39.90.20221231) #1 SMP PREEMPT_DYNAMIC Debian 6.1.4-1 (2023-01-07)
Jan 27 09:41:16 Zwerg kernel: Linux version 6.1.0-17-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)
Jan 27 09:44:53 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Jan 27 09:48:05 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
Jan 27 09:52:16 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
Jan 27 09:58:46 Zwerg kernel: Linux version 6.1.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-13) 12.2.0, GNU ld (GNU Binutils for Debian) 2.39.90.20221231) #1 SMP PREEMPT_DYNAMIC Debian 6.1.4-1 (2023-01-07)
Feb 01 04:19:17 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Feb 01 14:43:03 Zwerg kernel: igc 0000:0a:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
Feb 01 14:50:04 Zwerg kernel: Linux version 6.1.0-17-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)
Feb 01 15:28:42 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.5.10-1~bpo12+1 (2023-11-23)
Feb 08 18:26:31 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.5.10-1~bpo12+1 (2023-11-23)
Feb 08 18:33:38 Zwerg kernel: igc 0000:0a:00.0 eno1: PCIe link lost, device now detached
Feb 08 18:58:25 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.5.10-1~bpo12+1 (2023-11-23)
Feb 08 19:00:32 Zwerg kernel: igc 0000:0b:00.0 eno1: PCIe link lost, device now detached
Feb 08 19:02:38 Zwerg kernel: igc 0000:0b:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
Feb 08 19:05:30 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.5.10-1~bpo12+1 (2023-11-23)
Feb 09 13:25:08 Zwerg kernel: igc 0000:0b:00.0 eno1: PCIe link lost, device now detached
Feb 09 13:27:17 Zwerg kernel: igc 0000:0b:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
Feb 09 13:30:42 Zwerg kernel: Linux version 6.5.0-0.deb12.4-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.5.10-1~bpo12+1 (2023-11-23)
Feb 11 18:47:57 Zwerg kernel: igc 0000:0b:00.0 eno1: PCIe link lost, device now detached
Feb 12 10:55:30 Zwerg kernel: Linux version 6.1.0-17-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)

The kernel version I used were

Debian 6.1.4-1 (2023-01-07)
Debian 6.1.38-4 (2023-08-08)
Debian 6.1.52-1 (2023-09-07)
Debian 6.1.55-1 (2023-09-29)
Debian 6.1.64-1 (2023-11-30)
Debian 6.1.66-1 (2023-12-09)
Debian 6.1.69-1 (2023-12-30)
Debian 6.5.10-1~bpo12+1 (2023-11-23)


At this point, it looks like at least one person with a bit of insight is convinced this is an upstream issue.

Of course I'll try to provide whatever information else may be needed.

Most importantly, I think, is the hardware surrounding the NIC:
This is an ASUSTeK COMPUTER INC. ROG STRIX X670E-A GAMING WIFI, i.e. AMD X670 chipset with fershly updated BIOS: 1904 01/29/2024. CPU is an AMD Ryzen 9 7900X.

I have not set any particular overclocking or performance options, just tried to have all firmware settings on "conservative".


Mass storage is a Western Digital SN850X NVMe device.

I have experienced two cases where the storage device apparently "vanished" from the PCIe bus, which resulted in a flood of journald messages that it could not log anything to persistent storage. I have never seen the first few lines of thos occurences, and obviously, I have no logs.

I did notice, however, that the system still responded to pings on the network.

All of this seems to indicate that this might be related to PCIe power management. I suspect that my gut feeling is not the best starting point to decide how to proceed here.

So, if you any way to improve this situation and make the system reliably usable, I'm willing to help in any way I can, but you'll have to tell me what to do!

Cheers,

Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück