Mellanox interrupts are not load balanced

From: katiyar26@xxxxxxxxxxx
Date: Fri Jan 27 2023 - 05:00:39 EST


Hi,
I am running centos 7.7 VM in azure with Mellanox (mlx5_core) driver for NIC. It is running customized 3.10.0-1062.18.1.el7 kernel image with some minor changes in net directory.

It has created as many queues and irqs as the number of CPUs in VM but all the interrupts are being processed by CPU0 only. Irqbalance service is also running and smp_affinity is set differently for different irqs. I tried setting it manually after stopping the irqbalance service but still all the interrupts were targeted to CPU0 as can be seen from below output.

> cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 9881 0 0 0 0 0 0 0 IO-APIC-edge timer
1: 0 0 0 0 0 0 0 9 IO-APIC-edge i8042
3: 21 25 13 19 2 2 3 856 IO-APIC-edge
4: 68 6 25 22 21 10 19 360 IO-APIC-edge serial
8: 0 0 0 0 0 0 0 0 IO-APIC-edge rtc0
9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi
12: 0 0 0 0 0 0 0 5 IO-APIC-edge i8042
14: 602 318 226 232 278 205 69 8917 IO-APIC-edge ata_piix
15: 0 0 0 0 0 0 0 0 IO-APIC-edge ata_piix
24: 0 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_pages_eq@pci:8b76:00:02.0
25: 19694 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_cmd_eq@pci:8b76:00:02.0
26: 0 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_async_eq@pci:8b76:00:02.0
28: 123648 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp0@pci:8b76:00:02.0
29: 152455 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp1@pci:8b76:00:02.0
30: 102308 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp2@pci:8b76:00:02.0
31: 89403 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp3@pci:8b76:00:02.0
32: 86793 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp4@pci:8b76:00:02.0
33: 107817 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp5@pci:8b76:00:02.0
34: 117091 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp6@pci:8b76:00:02.0
35: 59714 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp7@pci:8b76:00:02.0
36: 0 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_pages_eq@pci:83a4:00:02.0
37: 12427 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_cmd_eq@pci:83a4:00:02.0
38: 0 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_async_eq@pci:83a4:00:02.0
40: 35520 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp0@pci:83a4:00:02.0
41: 576 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp1@pci:83a4:00:02.0
42: 34139 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp2@pci:83a4:00:02.0
43: 19951 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp3@pci:83a4:00:02.0
44: 41038 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp4@pci:83a4:00:02.0
45: 36569 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp5@pci:83a4:00:02.0
46: 42023 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp6@pci:83a4:00:02.0
47: 12610 0 0 0 0 0 0 0 PCI-MSI-edge mlx5_comp7@pci:83a4:00:02.0
NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts
LOC: 1536 1224 1240 1107 1299 1379 1171 2152 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 0 0 0 0 Performance monitoring interrupts
IWI: 726 143 776 309 780 370 748 1047 IRQ work interrupts
RTR: 0 0 0 0 0 0 0 0 APIC ICR read retries
RES: 59746 34162 150579 45146 149421 87954 149095 47137 Rescheduling interrupts
CAL: 2562 2717 2601 2590 2577 2649 2572 2557 Function call interrupts

Mellanox driver version is :
version: 5.0-0
license: Dual BSD/GPL
description: Mellanox 5th generation network adapters (ConnectX series) core driver
author: Eli Cohen <eli@xxxxxxxxxxxx>
rhelversion: 7.7
srcversion: 7D9FFD656B0EB1000804CB2

Same kernel with different NIC driver (in AWS) and igb driver in physical server works fine.
I tried centos7.9 (3.10.0-1160.76.1.el7) available in Azure market place and there I don't see the issue.

Please help in debugging/resolving this issue.

regards,
Nitin