RE: Problems with ixgbe driver

From: Holger Kiehl
Date: Mon Jun 17 2013 - 05:11:52 EST


Hello,

first, thank you for the quick help!

On Fri, 14 Jun 2013, Tantilov, Emil S wrote:

-----Original Message-----
From: netdev-owner@xxxxxxxxxxxxxxx [mailto:netdev-owner@xxxxxxxxxxxxxxx] On
Behalf Of Holger Kiehl
Sent: Friday, June 14, 2013 4:50 AM
To: e1000-devel@xxxxxxxxxxxx
Cc: linux-kernel; netdev@xxxxxxxxxxxxxxx
Subject: Problems with ixgbe driver

Hello,

I have dual port 10Gb Intel network card on a 2 socket (Xeon X5690) with
a total of 12 cores. Hyperthreading is enabled so there are 24 cores.
The problem I have is that when other systems send large amount of data
the network with the intel ixgbe driver gets very slow. Ping times go up
from 0.2ms to appr. 60ms. Some FTP connections stall for more then 2
minutes. What is strange is that heatbeat is configured on the system
with a serial connection to another node and kernel always reports

If the network slows down so much there should be some indication in dmesg. Like Tx hangs perhaps.
Can you provide the output of dmesg and ethtool -S from the offending interface after the issue occurs?

No, there is absolute no indication in dmesg or /var/log/messages. But here
the ethtool output when ping times go up:

root@helena:~# ethtool -S eth6
NIC statistics:
rx_packets: 4410779
tx_packets: 8902514
rx_bytes: 2014041824
tx_bytes: 13199913202
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 4245
collisions: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 28143
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_pkts_nic: 2401276937
tx_pkts_nic: 3868619482
rx_bytes_nic: 868282794731
tx_bytes_nic: 5743382228649
lsc_int: 4
tx_busy: 0
non_eop_descs: 743957
broadcast: 1745556
rx_no_buffer_count: 0
tx_timeout_count: 0
tx_restart_queue: 425
rx_long_length_errors: 0
rx_short_length_errors: 0
tx_flow_control_xon: 171
rx_flow_control_xon: 0
tx_flow_control_xoff: 277
rx_flow_control_xoff: 0
rx_csum_offload_errors: 0
alloc_rx_page_failed: 0
alloc_rx_buff_failed: 0
lro_aggregated: 0
lro_flushed: 0
rx_no_dma_resources: 0
hw_rsc_aggregated: 1153374
hw_rsc_flushed: 129169
fdir_match: 2424508153
fdir_miss: 1706029
fdir_overflow: 33
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_queue_0_packets: 470182
tx_queue_0_bytes: 690123121
tx_queue_1_packets: 797784
tx_queue_1_bytes: 1203968369
tx_queue_2_packets: 648692
tx_queue_2_bytes: 950171718
tx_queue_3_packets: 647434
tx_queue_3_bytes: 948647518
tx_queue_4_packets: 263216
tx_queue_4_bytes: 394806409
tx_queue_5_packets: 426786
tx_queue_5_bytes: 629387628
tx_queue_6_packets: 253708
tx_queue_6_bytes: 371774276
tx_queue_7_packets: 544634
tx_queue_7_bytes: 812223169
tx_queue_8_packets: 279056
tx_queue_8_bytes: 407792510
tx_queue_9_packets: 735792
tx_queue_9_bytes: 1092693961
tx_queue_10_packets: 393576
tx_queue_10_bytes: 583283986
tx_queue_11_packets: 712565
tx_queue_11_bytes: 1037740789
tx_queue_12_packets: 264445
tx_queue_12_bytes: 386010613
tx_queue_13_packets: 246828
tx_queue_13_bytes: 370387352
tx_queue_14_packets: 191789
tx_queue_14_bytes: 281160607
tx_queue_15_packets: 384581
tx_queue_15_bytes: 579890782
tx_queue_16_packets: 175119
tx_queue_16_bytes: 261312970
tx_queue_17_packets: 151219
tx_queue_17_bytes: 220259675
tx_queue_18_packets: 467746
tx_queue_18_bytes: 707472612
tx_queue_19_packets: 30642
tx_queue_19_bytes: 44896997
tx_queue_20_packets: 157957
tx_queue_20_bytes: 238772784
tx_queue_21_packets: 287819
tx_queue_21_bytes: 434965075
tx_queue_22_packets: 269298
tx_queue_22_bytes: 407637986
tx_queue_23_packets: 102344
tx_queue_23_bytes: 145542751
rx_queue_0_packets: 219438
rx_queue_0_bytes: 273936020
rx_queue_1_packets: 398269
rx_queue_1_bytes: 52080243
rx_queue_2_packets: 285870
rx_queue_2_bytes: 102299543
rx_queue_3_packets: 347238
rx_queue_3_bytes: 145830086
rx_queue_4_packets: 118448
rx_queue_4_bytes: 17515218
rx_queue_5_packets: 228029
rx_queue_5_bytes: 114142681
rx_queue_6_packets: 94285
rx_queue_6_bytes: 107618165
rx_queue_7_packets: 289615
rx_queue_7_bytes: 168428647
rx_queue_8_packets: 109288
rx_queue_8_bytes: 35178080
rx_queue_9_packets: 393061
rx_queue_9_bytes: 377122152
rx_queue_10_packets: 155004
rx_queue_10_bytes: 66560302
rx_queue_11_packets: 381580
rx_queue_11_bytes: 182550920
rx_queue_12_packets: 140681
rx_queue_12_bytes: 44514373
rx_queue_13_packets: 127091
rx_queue_13_bytes: 18524907
rx_queue_14_packets: 92548
rx_queue_14_bytes: 34725166
rx_queue_15_packets: 199612
rx_queue_15_bytes: 66689821
rx_queue_16_packets: 90018
rx_queue_16_bytes: 29206483
rx_queue_17_packets: 81277
rx_queue_17_bytes: 55206035
rx_queue_18_packets: 224446
rx_queue_18_bytes: 14869858
rx_queue_19_packets: 16975
rx_queue_19_bytes: 48400959
rx_queue_20_packets: 80806
rx_queue_20_bytes: 5398100
rx_queue_21_packets: 146815
rx_queue_21_bytes: 9796087
rx_queue_22_packets: 136018
rx_queue_22_bytes: 9023369
rx_queue_23_packets: 54781
rx_queue_23_bytes: 34724433

This was with the 3.15.1 driver and setting the combinde queue to 24 via
ethtool, as you suggested below.


ttyS0: 4 input overrun(s)

when lot of data is send and the ping time goes up.

On the network there are three vlan's configured. The network is bonded
(active-backup) together with another HP NC523SFP 10Gb 2-port Server
Adapter. When I switch the network to this card the problem goes away.
Also the ttyS0 input overruns disappear. Note also both network cards
are connected to the same switch.

The system uses Scientific Linux 6.4 with kernel.org kernel. I noticed
this behavior with kernel 3.9.5 and 3.9.6-rc1. Before I did not notice
it because traffic always went over the HP NC523SFP qlcnic card.

In search for a solution to the problem I found a newer ixgbe driver
3.15.1 (3.9.6-rc1. has 3.11.33-k) and tried that. But it has the same
problem. However when I load the module as follows:

modprobe ixgbe RSS=8,8

the problem goes away. The kernel.org ixgbe driver does not offer this
option. Why? It seems that both drivers have problems on systems with

If you are using newer kernel and ethtool version you can use `ethtool -L ethX combined Y` to control the number of queues per interface.

Okay, thank you! I did not know this.

24 cpu's. But I cannot believe that I am the only one who noticed this,
since ixgbe is widely used.

We run traffic with multiple queues all the time and I don't think what you are reporting is a generic issue. Most likely it's something related to your setup/system.

Yes, I think so too. But what could it be? Please, just ask what other
information I could provide. As I already mentioned earlier the ixgbe card
is bonded with a qlogic nic and I have two (not three) vlan configured over
over this bond. Maybe the following is useful (eth6 is the ixgbe driver):

root@helena:~# ethtool -k eth6
Features for eth6:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: on
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]


It would really be nice if one could set the RSS=8,8 option for kernel.org
ixgbe driver too. Or if someone could tell me where I can force the driver
to Receive Side Scaling to 8 even if it means editing the source code.

Below I have added some additional information. Please CC me since I
am not subscribed to any of these lists. And please do not hesitate
to ask if more information is needed.

I would suggest that you open up a bug at e1000.sf.net - describe your configuration and attach the relevant info (dmesg, ethtool -S, lspci etc). This would make it easier for us to follow.

Sorry, but I could not find out how I can open a new bug. I could just view
existing bugs. Please give me a hint what I need to do.

Thanks,
Holger
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/