2.6.21.x kernel panic (tg3 and nfs related)

From: Andre Noll
Date: Wed Jun 27 2007 - 06:54:42 EST


Hi

Our nfs server recently paniced under heavy nfs load. The backtrace
indicates that this might be a problem with the tigon3 network driver
which drives the onboard chips of the machine.

The first crash under 2.6.21.1 happened after about 4 days of uptime,
2.6.21.5 already crashed after 15 Minutes.

Screenshots of the resulting kernel panics are available at

http://www.systemlinux.org/~maan/shots/huangho-crash-2.6.21.1.png
and
http://www.systemlinux.org/~maan/shots/huangho-crash-2.6.21.5.png

We're now running 2.6.18.6 again which happens to be rock solid for our
workload. However, this kernel now spits out zillons of messages like

[55122.674290] RPC: bad TCP reclen 0x00010094 (large)

I'm sure it didn't do that half a year ago when it was running for
several months. The 2.6.21.x kernels did not print these messages
either, but from what I understand this is due to a patch which went
in somewhere between 2.6.18 and 2.6.21 and which just ratelimited
the message.

So something weird seems to be going on in our network and this might
well be related to the 2.6.21.x crashes we are seeing.

Thanks
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe

Attachment: signature.asc
Description: Digital signature