Re: server about to crash

Peter T. Breuer (ptb@it.uc3m.es)
Wed, 8 Jul 1998 18:50:16 +0200 (MET DST)


"A month of sundays ago kwrohrer@ce.mediaone.net wrote:"
>
> And lo, Peter T. Breuer saith unto me:
> >
> > > > Networking buffers in use : 9674
> > >
> > > Thats the interesting one, if that gradually and continually climbs
> >
> > First indications are that it does. I have been tarring over nfs to
> > /dev/null. But I was doing that before too.
>
> > Networking buffers in use : 12222
>
> I remember seeing this sort of thing back when fragmentation killed NFS-
> with-8k-pages usability. Have you been monitoring the NFS performance

I have .. in the sense that I watch everything. The nfs layout is as follows:

A 1GB system on this server is nfs mounted by about 10-30 linux machines
(it varies) using 8K r/wsize. Two solaris and 2 sgi machines also mount
that system. Mmm .. another couple of sparcs running RH 4.2 also mount
it.

Another 1GB system on the server is nfs mounted by the solaris machines
too.

A third 1GB system on the server (local binaries) is exported to all
linux machines.

In the other direction, this server also mounts the /etc file systems
(about 32MB each) of all other machines in reach. It uses amd to time out
mounts, as do the clients. It also mounts home file systems of about 1GB
each from two other linux machines.

The server also mounts and unmounts a 1GB system every morning, to
which it sends incremental backups sent over via ssh from backup
servers on other machines (all of them).

It also does samba and mail and stuff.

The performance over nfs is not marvellous, but it is acceptable. The
local (10BT) network has a 4% collision rate. The nfs mounts do time
out when the network is overloaded. This happens in particular when
NIS dies for more than a few seconds. That can happen when a NIS server
goes down and ypbind fails to rebind. A particularly dangerous time is
every hour, when NIS maps are collected from a solaris NIS+ server, and
ypbind is redirected with ypset for what in theory is a couple of
seconds, but can be much longer.

> at all? Back when this was a problem, it would get so bad the client
> would start timing out...

I am worried about nfs.

This kind of thing

Jul 8 12:10:31 arpa kernel: nfs_rpc_verify: RPC call failed: 5
Jul 8 12:26:32 arpa kernel: nfs_rpc_verify: RPC call failed: 5

(remote rpc.nfsd server down or timed out) can snowball. In particular, this

Jul 8 12:55:50 arpa kernel: eth0: transmit timed out, Tx_status 88 status 2004
Jul 8 13:00:43 arpa kernel: eth0: transmit timed out, Tx_status 88 status 2004
Jul 8 13:18:25 arpa kernel: eth0: transmit timed out, Tx_status 88 status 2004

can happen. That was me tarring stuff over nfs to /dev/null. This

Jul 8 07:52:07 arpa amd[8339]: Error reading RPC reply: Connection refused
Jul 8 07:53:07 arpa last message repeated 2 times

is more usual. That was what was happening when finally the server went down
this morning ...

Jul 8 10:34:10 arpa last message repeated 2 times
Jul 8 10:34:41 arpa amd[8339]: Error reading RPC reply: Connection refused
Jul 8 10:42:24 arpa kernel: Cannot find map file.

> Keith
>
>
>
> --
> "The avalanche has already started; |Linux: http://www.linuxhq.com |"Zooty,
> it is too late for the pebbles to |KDE: http://www.kde.org | zoot
> vote." Kosh, "Believers", Babylon 5 |Keith: kwrohrer@enteract.com | zoot!"
> www.midwinter.com/lurk/lurker.html |http://www.enteract.com/~kwrohrer | --Rebo
>

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu