5 day period in 2.0.* server

Peter T. Breuer (ptb@it.uc3m.es)
Sun, 7 Jun 1998 13:50:59 +0200 (MET DST)


One of my linux P100 servers (running 2.0.33 presently, but it makes no
difference if it's 33 or 30 or 28 or 25, which kind of rules out the
kernel as a cause, ahem :-) is dying with an approximate period of 5
days. It has been doing so for several months and I have been unable to
track down why. I believe it was not doing this several months ago, but
the symptoms may have been masked. I would welcome any ideas.

Symptoms

1) performance gets more and more sluggish, though I see no user or
system time being used in top.
2) load goes higher and higher, as you might expect if processes take
longer and longer to run
3) init seems to get itself as its own parent pid in the process table.
That makes the output of pstree go wild. Fortunately it's got a
printout limit built in! I have to kill it about 8 hours after
it becomes noticable. Sometimes I can't. Sometimes the software
watchdog can kill it when I send the daemon a stop signal. Sometimes
not.
4) degradation is exponential when symptoms appear, as one might
expect. The "half life" seems to be an hour.
5a) I am running amd but stopping amd and unmounting all nfs mounts
does not help markedly, though it helps a little.
5b) Stopping ypbind and restarting only gives me dead ypbinds (presumably
it times out somewhere) even though it should find ypserv on localhost
(these are the latest ypbind-mt and ypserv 3.*, btw).
5c) That leaves dns as a suspect, but everything needed is listed in
/etc/hosts. I suspect a bad interaction with a solaris 5.6 dns and
NIS(+) server on the same net. This is all bind 4.* stuff.
6) this particular server is not configured any differently from any
other, but it is more loaded. It distributes user directories via
nfs to at least ten-twenty machines over a 10BT net with a 3c509b
card. When the symptoms show, I get Tx timed out messages from the NIC
in logs. The server also loads nfs directories from ten other machines
for administrative purposes.
7) when the slowdown happens, I only see normal memory uses. I am about
10M into swap on a 64MB machine with very little cache or buffer
showing.
9) dmesg shows nothing unusual apart from the Tx distress messages and
nfs and rpc timeouts. Syslogd is comatose, but stopping and
restarting it doesn't result in any significant kernel activity.
10) sometimes httpd goes wild and spawns 256 processes, even though it's
process limited by its config files. I believe this to be a symptom
and not a cause.
11) netstat showed me nothing very suspicious. In particular no sockets
obviously stuck.

Here are some logs. This is all that showed in syslog yesterday between
the time that syslog was restrated automatically at 8am and the time I
forced a shutdown at 13.50. It took a litlle time to die ...

Jun 6 07:51:34 arpa syslogd: exiting on signal 15
Jun 6 08:00:15 arpa ypbind[15087]: localhost: RPC: Timed out
Jun 6 13:49:34 arpa inetd[153]: /usr/sbin/tcpd: exit signal 0xf
Jun 6 13:49:57 arpa inetd[153]: /usr/sbin/tcpd: exit signal 0xf
Jun 6 14:00:20 arpa ypbind[24445]: localhost: RPC: Timed out
Jun 6 14:04:19 arpa ypbind[24445]: tamtam.it.uc3m.es: RPC: Unable to receive; errno = Connection refused
Jun 6 14:05:36 arpa sshd[25081]: fatal: Local: Command terminated on signal 2.

I can't see anything unusual in the messages log, except perhaps for
the sheer quantity of traffic. There was some kind of activity being
logged every few seconds (as usual). Notably bootpd flinging insults at
persistent and unwanted requestors (I don't own the whole of this
segment), sshd logging my connects, icmplog noting unreachables from
various outside places, such as the national computing nexus, tcplog
telling me I've connected, lpd complaining about losing a connect,
named mumbling, etc. etc.

I almost suspect it's the logging that's dragging the machine down.
Could that be it? If everything times out and complains, it could
perhaps tie up the kernel just buffering i/o??? Then that would lead to
more timeouts and complaints? That might be exponential.

Hmm .. I had set the debug log to contain the loads snapped at 5min
intervals. That may be interesting. It shows a minor 1hr cycle imposed
on the gradually increasing load trend. Every hour the server pulls
NIS maps across by force from the solaris server. Amongst other things
it does a ypset for periods of some seconds at those times. It can't
use ypxfr against a NIS+ server. Perhaps I should enforce a timout in
the ypcat that it tries ..

Jun 6 07:55:08 average: 2.77, 1.65, 0.93
Jun 6 08:00:11 average: 3.00, 2.27, 1.38
Jun 6 08:05:08 average: 4.35, 4.31, 2.58
Jun 6 08:10:08 average: 2.36, 2.95, 2.47
Jun 6 08:15:10 average: 2.94, 2.66, 2.46
Jun 6 08:20:06 average: 2.47, 2.51, 2.44
Jun 6 08:25:06 average: 2.39, 2.29, 2.35
Jun 6 08:30:08 average: 2.53, 2.26, 2.30
Jun 6 08:35:07 average: 2.48, 2.29, 2.29
Jun 6 08:40:06 average: 2.41, 2.29, 2.28
Jun 6 08:45:08 average: 2.39, 2.19, 2.22
Jun 6 08:50:06 average: 2.33, 2.20, 2.20
Jun 6 08:55:08 average: 2.32, 2.14, 2.15
Jun 6 09:00:12 average: 2.96, 2.33, 2.21
Jun 6 09:05:08 average: 7.57, 6.02, 3.89
Jun 6 09:10:07 average: 2.97, 4.22, 3.73
Jun 6 09:15:07 average: 2.09, 2.87, 3.26
Jun 6 09:20:07 average: 2.56, 2.50, 2.98
Jun 6 09:25:09 average: 2.47, 2.43, 2.82
Jun 6 09:30:09 average: 2.59, 2.37, 2.66
Jun 6 09:35:09 average: 2.41, 2.30, 2.54
Jun 6 09:40:07 average: 2.22, 2.17, 2.41
Jun 6 09:45:08 average: 2.25, 2.14, 2.32
Jun 6 09:50:08 average: 2.84, 2.42, 2.38
Jun 6 09:55:06 average: 2.33, 2.25, 2.30
Jun 6 10:00:12 average: 2.56, 2.32, 2.30
Jun 6 10:05:08 average: 5.62, 5.03, 3.56
Jun 6 10:10:07 average: 2.72, 3.88, 3.52
Jun 6 10:15:08 average: 2.35, 2.86, 3.16
Jun 6 10:20:11 average: 2.28, 2.46, 2.90
Jun 6 10:25:07 average: 2.46, 2.32, 2.70
Jun 6 10:30:09 average: 2.51, 2.31, 2.57
Jun 6 10:35:08 average: 3.10, 2.53, 2.57
Jun 6 10:40:07 average: 2.41, 2.41, 2.50
Jun 6 10:45:07 average: 2.44, 2.39, 2.46
Jun 6 10:50:08 average: 2.63, 2.43, 2.45
Jun 6 10:55:09 average: 2.30, 2.38, 2.42
Jun 6 11:00:09 average: 2.79, 2.42, 2.41
Jun 6 11:05:11 average: 7.13, 5.69, 3.85
Jun 6 11:10:10 average: 4.58, 5.41, 4.27
Jun 6 11:15:09 average: 2.51, 3.62, 3.83
Jun 6 11:20:08 average: 2.75, 2.79, 3.38
Jun 6 11:25:11 average: 2.25, 2.37, 3.02
Jun 6 11:30:07 average: 2.55, 2.34, 2.81
Jun 6 11:35:07 average: 2.57, 2.33, 2.66
Jun 6 11:40:10 average: 2.47, 2.27, 2.53
Jun 6 11:45:08 average: 2.60, 2.31, 2.46
Jun 6 11:50:07 average: 2.29, 2.22, 2.37
Jun 6 11:55:09 average: 2.98, 2.40, 2.38
Jun 6 12:00:09 average: 3.64, 2.74, 2.50
Jun 6 12:05:11 average: 8.31, 6.23, 4.08
Jun 6 12:10:10 average: 6.31, 6.40, 4.75
Jun 6 12:15:08 average: 2.93, 4.27, 4.30
Jun 6 12:20:07 average: 2.37, 3.02, 3.74
Jun 6 12:25:09 average: 2.52, 2.63, 3.36
Jun 6 12:30:09 average: 2.48, 2.50, 3.10
Jun 6 12:35:09 average: 3.12, 2.70, 3.00
Jun 6 12:40:09 average: 3.22, 2.96, 3.02
Jun 6 12:45:10 average: 3.20, 2.81, 2.91
Jun 6 12:50:12 average: 2.91, 2.74, 2.84
Jun 6 12:55:09 average: 2.95, 2.77, 2.82
Jun 6 13:00:12 average: 3.34, 2.84, 2.81
Jun 6 13:05:16 average: 9.02, 6.55, 4.42
Jun 6 13:10:17 average: 8.69, 7.61, 5.45
Jun 6 13:15:15 average: 4.78, 6.26, 5.50
Jun 6 13:20:12 average: 3.64, 4.43, 4.90
Jun 6 13:25:13 average: 3.02, 3.47, 4.34
Jun 6 13:30:13 average: 5.26, 3.93, 4.23
Jun 6 13:35:14 average: 4.82, 3.80, 4.04
Jun 6 13:40:11 average: 4.00, 3.82, 3.97
Jun 6 13:45:05 average: 4.21, 3.73, 3.86
Jun 6 13:50:06 average: 3.85, 3.68, 3.78
Jun 6 13:55:03 average: 2.53, 3.27, 3.60
Jun 6 14:00:09 average: 2.23, 2.58, 3.19
Jun 6 14:05:15 average: 7.53, 6.00, 4.54
Jun 6 14:10:09 average: 5.59, 5.97, 4.96
Jun 6 14:15:03 average: 1.15, 3.20, 4.07 <- reboot?
Jun 6 14:20:09 average: 1.28, 1.58, 3.09
Jun 6 14:25:04 average: 0.02, 0.65, 2.27 <- normality!
Jun 6 14:30:05 average: 0.01, 0.26, 1.65
Jun 6 14:35:04 average: 0.29, 0.30, 1.28
Jun 6 14:40:04 average: 0.07, 0.13, 0.93
Jun 6 14:45:04 average: 0.01, 0.06, 0.67
Jun 6 14:50:04 average: 0.50, 0.25, 0.57
Jun 6 14:55:05 average: 0.05, 0.12, 0.42
Jun 6 15:00:07 average: 0.12, 0.09, 0.31
Jun 6 15:05:17 average: 6.77, 3.86, 1.87
Jun 6 15:10:04 average: 0.97, 2.47, 1.87
Jun 6 15:15:05 average: 0.87, 1.38, 1.54
Jun 6 15:20:07 average: 0.87, 0.99, 1.33
Jun 6 15:25:05 average: 0.06, 0.53, 1.04
Jun 6 15:30:04 average: 0.41, 0.30, 0.78
Jun 6 15:35:05 average: 0.18, 0.23, 0.61
Jun 6 15:40:05 average: 2.25, 1.46, 0.98
Jun 6 15:45:04 average: 0.56, 0.80, 0.81
Jun 6 15:50:05 average: 0.42, 0.54, 0.68
Jun 6 15:55:05 average: 0.15, 0.27, 0.52
Jun 6 16:00:09 average: 0.55, 0.35, 0.47
Jun 6 16:35:01 average: 2.46, 1.40, 0.59
Jun 6 16:40:02 average: 0.02, 0.57, 0.46
Jun 6 16:45:02 average: 0.12, 0.36, 0.39
Jun 6 16:50:02 average: 0.17, 0.25, 0.33

Peter ptb@it.uc3m.es

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu