Some gathered information on my ongoing NFS problems.
My test setup involves two machines:
cy:
6x86 Cyrix P150+
32M memory
1GB Fast-Wide SCSI drive
Adaptec 2940UW SCSI controller
NetGear 100Base adapter (Tulip 21140 chipset)
amd:
5x86 AMD 133Mhz.
24M memory
500MB Fast SCSI2 drive
NCR/Symbios clone SCSI controller
NetGear 100Base adapter
BOTH:
kernel 2.1.55 w/ NFS client & server compiled in
Donald Becker's latest beta Tulip driver (0.79)
machines are directly connected with a crossover cable.
libc 5.4.33
binutils 2.8.1.0.1
ld-linux.so.1.9.2
gcc 2.7.2 (w/ specs modified for -fno-strength-reduce)
SETUP:
Kernel sources for 2.1.55 are on cy in /usr/src/linux-2.1.55
All test builds are performed on amd
***************************************************************
Datapoint 1:
To eliminate autofs as a perturbing factor, cy:/ is manually mounted
on amd as /net/cy. A symlink named "linux" is created in amd's
/usr/src directory pointing at /net/cy/usr/src/linux-2.1.55.
Starting with a source tree that has undergone "make mrproper"
(executed on the server, cy), I do
make dep ; make compressed
on amd.
This build will proceed to completion, and produces a working kernel.
***************************************************************
Datapoint 2:
Eliminate the symlink indirection by making
/net/cy/usr/src/linux-2.1.55 the current working directory on the
client (amd)..
Now the fun begins!
First, running "make clean" fails when attempting to execute:
rm -f `find modules/ -type f -print`
The error message is from "find". It cheerfully lists all the
symlinks in the modules subdirectory one to a line, each followed by a
complaint that the file does not exist. The directory appears empty
when ls is run manually. Moving to the server, we find a directory
where all these entries exist as dangling symlinks (the target
files were removed in an earlier cleanup step).
To help the process along, I remove the bad links manually at the server
console. A re-run of "make clean" then succeeds.
Next, "make dep" is run on the client machine (amd). This completes
normally.
Finally, "make compressed". The actual build fails 100% of the time.
It will chug along for a while, then die with a message of:
foo.o: No space left on device
(standard input): Assembler messages:
(standard input):3186: FATAL: Can't write foo.o: No such file or directory
make[2]: ** [foo.o] Error 1
etc, etc..
Where the "foo" is not deterministic (and naturally the server
isn't really out of space).
A quick ls of the build directory from the client shows that about
half of the subdirectories under the root of the source tree are
inaccessible; displaying with a message of "Stale NFS file handle".
If I umount and re-mount the server, they spring back into existence.
However a restarted build will eventually fail again with the same
symptoms.
I've yet to have one complete.
******************************************************************
Datapoint 3:
Reboot cy under 2.0.31-pre9 with user-space NFS server.
Absolutely zero problems. I can build and rebuild to my heart's
content.
*******************************************************************
Datapoint 4:
Run _both_ machines under 2.0.31-pre9 with user-space nfs and old
client respectively.
Absolutely no problems, but noticeably slower.
********************************************************************
Datapoint 5:
The server is running 2.1.55, and the client (amd) is running
2.0.3x with the original NFS client.
With the server mounted as /net/cy, I cd to /net/cy/tmp and run the
iozone benchmark on "auto". It blows up 100% of the time (at random
points), producing an endless cycle of these messages at the server:
Sep 12 20:23:20 cy kernel: nfs: RPC call returned error 111
Sep 12 20:23:20 cy kernel: RPC: task of released request still queued!
Sep 12 20:23:20 cy kernel: RPC: (task is on xprt_pending)
At the client, I see:
Error writing block nnn
iozone: Connection refused
interspersed with echoed server complaints.
At this point, things are so fouled up that both boxes must be
rebooted to restore operation.
**********************************************************************
Datapoint 6:
As above, but with the server running the old server software.
Nary a problem...
**********************************************************************
Conclusions:
By elimination, I would conclude that something is broken in the
2.1.55 kernel-based NFS server.
For the umpteenth time:
This stuff is 100% repeatable. If there are any debugging hooks you would
like me to put in, or any patches to try, please say the word!
In the hope that this is of help.
Steve