problem with pvm and headless cluster

KOEHLEKR@UCRWCU.RWC.UC.EDU
Wed, 17 Mar 1999 23:50:39 -0500 (EST)


To: Linux kernel listserve
Beowulf listserve
comp.parallel.pvm

Hi All,

I am having a strange problem with a Beowulf cluster. I am going to describe
it in ungodly detail, since I have not been inside the kernel for all that
long and (hopefully) I am missing something that some of you might see.

I am using pvm 3.3.11-10 with kernel 2.0.36. The cluster consists of a
master which exports (NFS) all necessary filesystems to the headless (and
diskless) slaves. When attempting to add a slave to the master's pvm
configuration, there is a long (minutes) delay followed by a successful
completion message, but while the pvmd is running on the slave, the master
does not show the slave in the config, and the slave pvmd bails soon after.

The problem is that when I install the same software on a cluster with
normal PCs (so no NFS), I have no problems whatsoever.

The master's log file reads:

[t80040000] netoutput() timed out sending to slave2 after 14, 190.000000
[t80040000] hd_dump() ref 1 t80000 n "slave2" a "" ar "LINUX"
[t80040000] lo "" so "" dx "" ep "" bx "" wd "" sp 1000
[t80040000] sa 192.168.1.2:1036 mtu 4096 f 0x0 e 0 txq 1
[t80040000] tx 2 rx 1 rtt 1.000000
[t80040000] dm_halt() from (ken), halting...
[t80040000] work() pvmd halting
[t80040000] pvmbailout(0)

The slave's log file reads:

[t80080000] slave2 (192.168.1.2:1036) LINUX 3.3.11
[t80080000] ready Thu Mar 11 19:50:53 1999
[t80080000] netinput() recvfrom(netsock): Connection refused
(repeated 13 more times)
[t80080000] work() run = STARTUP, timed out waiting for master
[t80080000] pvmbailout(0)

The calling sequence which results in the refusal appears to be:

pvmd.c : main -> work -> netinput ->
socket.c: sys_socketcall -> sys_recvfrom ->
udp.c: udp_recvmsg ->
datagram.c: skb_recv_datagram.

I do not see where the ECONNREFUSED can (or should) be returned to the
pvmd. The man page on recvfrom does not list it as a possible return
code (although it is a possibility on a listen, that should not apply
to a UDP conversation).

The relevant network traffic is as follows (it takes place in .3 seconds).
In order to illustrate what is taking place, I include the names of all
files, directories and links which are referenced in NFS packets. "err"
indicates a lookup which was not found.

master and slave talk about rsh pvmd:

master.1023 -> slave.shell SYN
slave.shell -> master.1023 SYN ACK
master.1023 -> slave.shell ACK
master.1023 -> slave.shell PUSH "1022" ACK
slave.shell -> master.1023 ACK

slave & master converse on NFS ports for tcpd verify, slave log entry "in.rshd connect",
pam auth (?):

etc passwd usr sbin tcpd lib ld-linux.so.2 ld-2.0.7.so ld.so.preload (err)
ld.so.cache libc.so.6 libc-2.0.7.so nsswitch.conf libnss_nisplus.so.1 (err)
lib libnss_nisplus.so.1 (err) libnss_files.so.1 libnss_files-2.0.7.so
protocols hosts.allow hosts.deny localtime ../usr/share/zoneinfo/US/Eastern
.. share zoneinfo US Eastern dev log in.rshd ld-2.0.7.so ld.so.preload (err)
libdl.so.2 libdl-2.0.7.so libpam.so.0 libpam.so.0.64 libpam_misc.so.0
libpam_misc.so.0.64 libc-2.0.7.so libnss_nisplus.so.1 (err) libnss_nisplus.so.1 (err)
libnss_files-2.0.7.so

master and slave set up connection over a new port (WHICH IS NEVER REALLY USED):

slave.1023 -> master.1022 SYN
master.1022 -> slave.1023 SYN ACK
slave.1023 -> master.1022 ACK

more NFS conversation preparing for DNS request, and a username (or dir?):

resolve.conf

master.1023 -> slave.shell PUSH "ken" ACK

more NFS conversation preparing for DNS request:

hosts libnss_nis.so.1 libnss_nis-2.0.7.so libnsl.so.1 libnsl-2.0.7.so libnss_dns.so.1
libnss_dns-2.0.7.so libresolv.so.2 libresolv-2.0.7.so

slave goes to master for DNS (with which I have a slight config error) and to start pvmd:

slave.1040 -> master.domain PTR? master.in-addr.arpa
master.domain -> slave.1040 PTR nameserver.in-addr.arpa
slave.shell -> master.1023 ACK
master.1023 -> slave.shell PUSH "ken pvmd -s -d0 -nslave2 1 7f000001:04a9 4096 2 c0a80102:0000" ACK
slave.1041 -> master.domain A? nameserver.in-addr.arpa
master.domain -> slave.1041 NXDomain (!)

more NFS conversation for slave log entry "couldn't find nameserver" and rhost auth; also an ACK:

../usr/share/zoneinfo/US/Eastern home ken pam.d rsh security pam_rhosts_auth.so
pam_nologin.so pam_pwdb.so libpwdb.so.0

slave.shell -> master.1023 ACK

more NFS conversation for pam auth and slave log entry "pam_rhosts_auth allowed" and
start write of slave log entry "pam_pwdb rsh open":

libpwdb.so.0.55 libcrypt.so.1 libcrypt-2.0.7.so other pam_deny.so hosts.equiv
nologin (err) pwdb.conf shadow (err) shadow (err) group nologin (err)

Not sure what this is:

slave.shell -> master.1023 PUSH "" ACK

more NFS conversation for continuation of last log entry and an ACK:

bin bash ld-2.0.7.so ld.so.preload (err) libtermcap.so.2 libtermcap.so.2.0.8
libc-2.0.7.so libnss_files-2.0.7.so libnss_nisplus.so.1 (twice, err)
libnss_nis-2.0.7.so

master.1023 -> slave.shell ACK

more NFS conversation for shell setup, start of pvmd and writing of pvmd pid:

libnsl-2.0.7.so .. .. etc proc tmp var root lib sbin usr bin bash (err) .bashrc
bashrc pvmd /usr/pvm3/lib/pvmd pvm3 lib pvmd sh bash ld-linux.so.2 ld-2.0.7.so
ld.so.preload (err) ld.so.cache libtermcap.so.2.0.8 libc.so.6 libc-2.0.7.so
nsswitch.conf libnss_files.so.1 libnss_files-2.0.7.so passwd libnss_nisplus.so.1 (err)
lib libnss_nisplus.so.1 (err) libnss_nis-2.0.7.so libnsl-2.0.7.so dev sh (err) bash
/usr/pvm3/lib/pvmd .pvmprofile (twice err) LINUX pvmd3 ld-2.0.7.so ld.so.preload (err)
libc-2.0.7.so null pvml.500 libnss_files-2.0.7.so libnss_nisplus.so.1 (twice err)
libnss_nis-2.0.7.so libnsl-2.0.7.so pvmd.500 (err, then create & write)

Here slave tells master that pvmd is up and running, and how to talk with it -
THE UDP PACKET WAS NEVER ANSWERED; shell and unused connections are closed

slave.shell -> master.1023 PUSH "ddpro<2315> arch<LINUX> op<c0a80102:0412> mtu<4096>\n" ACK
master.1193 -> slave.1042 UDP 0x8008000080040000000100000700000080020009000000010000000000000000
master.1023 -> slave.shell FIN ACK
slave.shell -> master.1023 ACK
slave.1023 -> master.1022 FIN ACK
master.1022 -> slave.1023 ACK

more NFS conversation for slave log entry "pam_pwdb rsh closed" and more closing:

log

slave.shell -> master.1023 FIN ACK
master.1023 -> slave.shell ACK

more NFS conversation for continuation of log entry and more closing:

master.1022 -> slave.1023 FIN ACK

more NFS conversation for continuation of log entry above and new slave log entries
"[t80080000] slave2 (192.168.1.2:1036) LINUX 3.3.11", "[t80080000] ready Thu Mar 11 19:50:53 1999"
and more closing:

slave.1023 -> master.1022 ACK

more NFS conversation for continuation of above log entries and new slave log entry
"[t80080000] netinput() recvfrom(netsock): Connection refused":

localtime ../usr/share/zoneinfo/US/Eastern .. share zoneinfo US Eastern

after which the sequence beginning with the UDP packet repeats at longer and longer
time intervals until the bailout occurs.

As I note above, the slave never responded to the master's UDP packet; I suspect
that the connection refused error is the cause of that omission, but I am not
sure; since the error doesn't make sense with UDP, it's a little hard to figure.
Since this problem does not occur with a cluster which is NOT running NFS, I have
a hunch that the problem is caused by a confusion of sockets due to the increased
UDP traffic. It is worth noting that the headless cluster are 400 MHz (100 MHz
bus) while the normal cluster (which works) are 333 MHz (66 MHz bus).

If anyone has any ideas, previous experience or fixes (!) relevant to this problem,
please email me directly at koehlekr@ucrwcu.rwc.uc.edu or kenneth.koehler@uc.edu.

Thanks in advance for your time (and patience in reading this!).

Ken

Dr. Kenneth R. Koehler
Associate Professor of Physics
Dept. of Mathematics, Physics and Computer Science
Raymond Walters College
University of Cincinnati

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/