Help diagnosing NFS lockups

From: Lars Kellogg-Stedman
Date: Tue May 25 2004 - 14:27:26 EST


Hi there,

I'm hoping someone out there more kernel-knowledgeable than I can lend a
hand in tracking down some NFS-related problems we're having with our
Linux mailserver.

We're running RH9, with kernel 2.4.20-31.9. The mail spool is mounted
from a Network Appliance filer (rw, fg, tcp, vers=3, timeo=600,
rsize=8192, wsize=8192, hard, intr). There's also a lot of NFS
automounter activity on the box for user home directories
(rw,nosuid,intr).

Periodically, the load on this system spikes way up -- but not because any
processes are using CPU time. Possibly there are "many" processes waiting
for IO, although "many" only appears to be around 53. The culprit seems to
be that access to the NFS filesystem is "slow", but we haven't been able to
quantify this (so the problem may in fact be something else entirely).
Rebooting the system makes the problem go away temporarily.

Yes, we know the mail spool probably shouldn't sit on an NFS filesystem.
We're working on changing that, but this is what we inherited, and we only
started encountering these problems after migrating from an underpowered
Solaris box to an IBM Bladecenter running Linux (we're using the bcm5700
drivers, rather than than the tg3 drivers from the stock kernel, due in
part to reports of NFS lockups with the tg3 drivers on the blades).

Any suggestions people can send our way would be greatly appreciated. If
you're in the Boston area and would like to discuss resolving this on
contract, we may be able to work something out.

I'd appreciate it if you would Cc: me on any replies.

-- Lars

--
Lars Kellogg-Stedman <lars@xxxxxxxxxxxxxxxx>
IT Operations Manager
Division of Engineering and Applied Sciences
Harvard University


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/