[5.17.0-rc6][DLPAR][SRIOV/mlx5]EEH errors and WARNING: CPU: 7 PID: 30505 at include/rdma/ib_verbs.h:3688 mlx5_ib_dev_res_cleanup

From: Abdul Haleem
Date: Mon Mar 07 2022 - 09:59:19 EST


Greeting's

HMC DLPAR hotplug of SRIOV logical device backed with Everglade melanox adapter results in EEH error messages followed by WARNINGS on my PowerPC P10 LPAR running latest 5.17-rc6 kernel


from hmc dlpar remove and than add the SRIOV device
$ chhwres -r sriov -m ltcden11 --rsubtype logport -o r --id 9 -a adapter_id=1,logical_port_id=2700400f
$ chhwres -r sriov -m ltcden11 --rsubtype logport -o a --id 9 -a phys_port_id=0,adapter_id=1,logical_port_id=2700400f,logical_port_type=eth

the above command completed but the console is filled with EEH errors and warnings

console messages
PC: Registered rdma backchannel transport module.
mlx5_core 400f:01:00.0 eth1: Link up
IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
mlx5_core 8005:01:00.0 eth2: Link up
IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
mlx5_core 400f:01:00.0: poll_health:800:(pid 0): Fatal error 1 detected
EEH: Recovering PHB#400f-PE#10000
EEH: PE location: N/A, PHB location: N/A
mlx5_core 400f:01:00.0: print_health_info:424:(pid 0): PCI slot is unavailable
mlx5_core 400f:01:00.0: mlx5_trigger_health_work:756:(pid 0): new health works are not permitted at this stage
EEH: Frozen PHB#400f-PE#10000 detected
EEH: Call Trace:
EEH: [c000000000054d10] __eeh_send_failure_event+0x70/0x150
EEH: [c00000000004df98] eeh_dev_check_failure+0x2e8/0x6c0
EEH: [c00000000004e438] eeh_check_failure+0xc8/0x100
EEH: [c0000000006a04b4] ioread32be+0x114/0x180
EEH: [c008000000d42bc0] mlx5_health_check_fatal_sensors+0x28/0x180 [mlx5_core]
EEH: [c008000000d43448] poll_health+0x50/0x260 [mlx5_core]
EEH: [c00000000021fed0] call_timer_fn+0x50/0x200
EEH: [c000000000220e90] run_timer_softirq+0x340/0x7c0
EEH: [c000000000c9e85c] __do_softirq+0x15c/0x3d0
EEH: [c00000000014f068] irq_exit+0x168/0x1b0
EEH: [c000000000026f84] timer_interrupt+0x1a4/0x3e0
EEH: [c000000000009a08] decrementer_common_virt+0x208/0x210
EEH: [c00000000367bdc0] 0xc00000000367bdc0
EEH: [c0000000009bf764] dedicated_cede_loop+0x94/0x1a0
EEH: [c0000000009bc094] cpuidle_enter_state+0x2d4/0x4e0
EEH: [c0000000009bc338] cpuidle_enter+0x48/0x70
EEH: [c00000000019ded4] call_cpuidle+0x44/0x80
EEH: [c00000000019e4b0] do_idle+0x340/0x390
EEH: [c00000000019e730] cpu_startup_entry+0x30/0x40
EEH: [c0000000000605a0] start_secondary+0x290/0x2b0
EEH: [c00000000000d154] start_secondary_prolog+0x10/0x14
EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
EEH: Notify device drivers to shutdown
EEH: Beginning: 'error_detected(IO frozen)'
mlx5_core 400f:01:00.0: wait_func_handle_exec_timeout:1108:(pid 30505): cmd[0]: DESTROY_RMP(0x90e) No done completion
mlx5_core 400f:01:00.0: wait_func:1136:(pid 30505): DESTROY_RMP(0x90e) timeout. Will cause a leak of a command resource
------------[ cut here ]------------
Destroy of kernel SRQ shouldn't fail
WARNING: CPU: 7 PID: 30505 at include/rdma/ib_verbs.h:3688 mlx5_ib_dev_res_cleanup+0x104/0x1a0 [mlx5_ib]
Modules linked in: sit tunnel4 ip_tunnel rpadlpar_io rpaphp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag bonding rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi scsi_transport_iscsi mlx5_ib ib_uverbs ib_core xts pseries_rng vmx_crypto gf128mul sch_fq_codel binfmt_misc ip_tables ext4 mbcache jbd2 dm_service_time mlx5_core sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth mlxfw ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 7 PID: 30505 Comm: drmgr Not tainted 5.17.0-rc6-autotest-g669b258a793d #1
NIP: c0080000023cf20c LR: c0080000023cf208 CTR: c000000000702790
REGS: c0000000111b7420 TRAP: 0700 Not tainted (5.17.0-rc6-autotest-g669b258a793d)
MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48088224 XER: 00000005
CFAR: c000000000143c90 IRQMASK: 0
GPR00: c0080000023cf208 c0000000111b76c0 c008000002438000 0000000000000024
GPR04: 00000000ffff7fff c0000000111b7390 c0000000111b7388 0000000000000027
GPR08: c0000018fd067e00 0000000000000001 0000000000000027 c0000000027a68f0
GPR12: 0000000000008000 c0000018ff984e80 0000000000000000 0000000119d902a0
GPR16: 00007fffd673e838 0000000119d90ed0 0000000119da3070 0000000106ad1e38
GPR20: 0000000106acf330 0000000106acf3d8 0000000106acd838 0000000119da3208
GPR24: 0000000000000007 0000000000000000 c008000000e78320 c000000002818eb8
GPR28: c00000000fd210d0 c0080000024328a8 c000000017808000 c000000017808000
NIP [c0080000023cf20c] mlx5_ib_dev_res_cleanup+0x104/0x1a0 [mlx5_ib]
LR [c0080000023cf208] mlx5_ib_dev_res_cleanup+0x100/0x1a0 [mlx5_ib]
Call Trace:
[c0000000111b76c0] [c0080000023cf208] mlx5_ib_dev_res_cleanup+0x100/0x1a0 [mlx5_ib] (unreliable)
[c0000000111b7730] [c0080000023d4c00] __mlx5_ib_remove+0x78/0xc0 [mlx5_ib]
[c0000000111b7770] [c00000000082479c] auxiliary_bus_remove+0x3c/0x70
[c0000000111b77a0] [c000000000814278] device_release_driver_internal+0x168/0x2d0
[c0000000111b77e0] [c000000000811748] bus_remove_device+0x118/0x210
[c0000000111b7860] [c000000000809a18] device_del+0x1d8/0x4e0
[c0000000111b7920] [c008000000d601b0] mlx5_rescan_drivers_locked.part.9+0xf8/0x250 [mlx5_core]
[c0000000111b79d0] [c008000000d60870] mlx5_unregister_device+0x48/0x80 [mlx5_core]
[c0000000111b7a00] [c008000000d32930] mlx5_uninit_one+0x38/0x100 [mlx5_core]
[c0000000111b7a70] [c008000000d33330] remove_one+0x58/0xa0 [mlx5_core]
[c0000000111b7aa0] [c000000000736d0c] pci_device_remove+0x5c/0x100
[c0000000111b7ae0] [c000000000814278] device_release_driver_internal+0x168/0x2d0
[c0000000111b7b20] [c000000000728a98] pci_stop_bus_device+0xa8/0x100
[c0000000111b7b60] [c000000000728cdc] pci_stop_and_remove_bus_device_locked+0x2c/0x50
[c0000000111b7b90] [c000000000739d20] remove_store+0xc0/0xe0
[c0000000111b7be0] [c000000000806870] dev_attr_store+0x30/0x50
[c0000000111b7c00] [c0000000005767c0] sysfs_kf_write+0x60/0x80
[c0000000111b7c20] [c000000000574e50] kernfs_fop_write_iter+0x1a0/0x2a0
[c0000000111b7c70] [c00000000045e3ec] new_sync_write+0x14c/0x1d0
[c0000000111b7d10] [c000000000461904] vfs_write+0x234/0x340
[c0000000111b7d60] [c000000000461bc4] ksys_write+0x74/0x130
[c0000000111b7db0] [c00000000002f608] system_call_exception+0x178/0x380
[c0000000111b7e10] [c00000000000c64c] system_call_common+0xec/0x250
--- interrupt: c00 at 0x20000026bd74
NIP: 000020000026bd74 LR: 00002000001e34c4 CTR: 0000000000000000
REGS: c0000000111b7e80 TRAP: 0c00 Not tainted (5.17.0-rc6-autotest-g669b258a793d)
MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 24004222 XER: 00000000
IRQMASK: 0
GPR00: 0000000000000004 00007fffd673e650 0000200000367100 0000000000000007
GPR04: 0000000119da3ea0 0000000000000001 fffffffffbad2c84 0000000119d902a0
GPR08: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000000000 000020000005b520 0000000000000000 0000000119d902a0
GPR16: 00007fffd673e838 0000000119d90ed0 0000000119da3070 0000000106ad1e38
GPR20: 0000000106acf330 0000000106acf3d8 0000000106acd838 0000000119da3208
GPR24: 0000000119da3219 00007fffd673e878 0000000000000001 0000000119da3ea0
GPR28: 0000000000000001 0000000119d902a0 0000000119da3ea0 0000000000000001
NIP [000020000026bd74] 0x20000026bd74
LR [00002000001e34c4] 0x2000001e34c4
--- interrupt: c00
Instruction dump:
60000000 3d420000 e94a84c8 892a0000 2f890000 409eff64 3c620000 e86384d0
39200001 992a0000 48032a1d e8410018 <0fe00000> 3d420000 e94a84c8 892a0000
---[ end trace 0000000000000000 ]---
------------[ cut here ]------------
WARNING: CPU: 7 PID: 30505 at drivers/infiniband/core/verbs.c:347 ib_dealloc_pd_user+0x68/0xd0 [ib_core]
Modules linked in: sit tunnel4 ip_tunnel rpadlpar_io rpaphp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag bonding rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm li

--
Regard's

Abdul Haleem
IBM Linux Technology Center