ksoftirqd causing severe network performance problems

From: Brendan Keessen
Date: Fri Sep 05 2003 - 05:59:17 EST


Hi,

More than a week ago we replaced our old linux core routers (in a
failover setup), with a new one. The old used 2 100 mbit NICs and worked
very well, however we needed more than a 100 mbit throughput, so we
replaced the setup with an almost identical setup based on two new
servers with 2 1g NICs. At peek time it processes about 70 Mbits/sec of
traffic and we use vlan's and use iptables, firewalling and DNAT of
almost all the connections, the same as in the old setup.

At the end of last week, the new setup had network problems and what we saw on
the linux router was that the kernel threads ksoftirqd_CPU1 and ksoftirqd_CPU0
were using almost 100% of system time and the network throughput collapsed.
This happens every day once or twice but the first one seems reasonably
predictable and happens when the network traffic raises from a constant
throughput from 3 Mbit/sec to 46 Mbit/sec in 3 hours. At a rough 40 Mbit/sec
the problem occures and a failover to the slave router solves the problem. On
the faulty server (previously master) the 100% CPU usage drops to almost 100%
idle. When the backup is working fine, we can't use the faulty server anymore
for routing/firewalling because failing back to it again results in an instant
100% system time again. Rebooting the system helps.

Because the router was a new server (Dell 2650/Dual Xeon) and it had a new
network card (Gigabit Broadcom 5703, which we never used before in our servers)
we thought that maybe the driver for the card was causing the problem. After
switching drivers and switching between kernel versions (2.4.21/2.4.22/
2.4.18(which ran perfectly on our old router))we eventually choose to replace
the server with a dell 1650 which has 2 gigabit e1000 interfaces. Different
kernels and e1000 drivers resulted in the same problem again. Now we are
running 2.4.18 with the 4.3.2 e1000 driver. I know we don't use the newest
kernel and newest driver but this doesn't seem to cause the problem because we
tested with other network cards, drivers and kernel versions.

The same problem still exists on the new server with totally different
network cards. In the kernel logfiles we don't see any messages at all
which are related to the problem.

Here is some info which tells something about the server when the network
performance collapses and ksoftirqd_CPU0/ksoftirqd_CPU1 are using 99% system
time:

routing cache (no. entries):

$ ip r ls cache | grep from | wc -l
69323

$ cat /proc/sys/net/ipv4/route/max_size
131072

$ cat /proc/sys/net/ipv4/route/gc_thresh
8192

We thought maybe for some reasom the routing cache is thrashing so we
experimented with changing the max_size to 4 times the current value and
raising the gc_thresh to 80% of that value and gc_elasticity to 32. But
that didn't help and the same problem occured again.

info on ip conntrack (no. entries):

$ cat /proc/net/ip_conntrack | wc -l
126804

The ip_conntrack module is loaded with the hashsize parameter:

ip_conntrack hashsize=2097152

To give you more input I turned on profiling. Read and clear profiling info
every 60 second. The kernel functions which use the most clockticks (top 10)
during the problem are:

31 handle_IRQ_event 0.2500
32 add_timer 0.1311
34 net_rx_action 0.0467
35 __kfree_skb 0.1136
35 batch_entropy_store 0.1944
40 dev_queue_xmit 0.0535
50 ip_route_input 0.1238
676 __write_lock_failed 21.1250
2928 __read_lock_failed 146.4000
3620 default_idle 69.6154

A few minutes before the problem occured (normal state):

37 __kfree_skb 0.1201
49 net_rx_action 0.0673
50 dev_queue_xmit 0.0668
50 handle_IRQ_event 0.4032
54 ip_route_input 0.1337
56 schedule 0.0422
68 __write_lock_failed 2.1250
73 batch_entropy_store 0.4056
742 __read_lock_failed 37.1000
8893 default_idle 171.0192

I also monitored interrupts (/proc/interrupts) of eth0 and eth1 but the
interrupts seem related with the throughput at that moment and no
strange burst of interrupts occure:

Before and during the problem occures the interrupts are about:

eth0: 5000/s
eth1: 4200/s

Does anybody know why we have this problem and how to solve it. Or could
you maybe tell me what more info is needed and how I can get it, to
resolve the problem.

Thanks,
Brendan Keessen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/