NFS client hangs under certain circumstances on SMP machine

From: Olivier Croquette
Date: Tue Feb 28 2006 - 15:33:58 EST


Hi

I have already sent this message on the NFS mailing-list, but I had no reaction there. May be you kernel hackers have an idea?



I have a strange problem since a few months on some Linux clients.

I have a file server accessed through:
- NFS from Linux clients (autofs, but direct mount causes same effect)
- Samba from Windows clients

This works since several years like a charm, but as I said there is a
strange problem that appeared recently:

I have a directory, to which I generate code from Windows (\\server\dir)
I can see it under Linux (/mount/dir) where I can access (compile) the
files.

However, when I regenerate the file under Windows again (ie. I overwrite
the old files), and I try to compile the files again under Linux, "make"
hangs simply in D state:

# ps aux | grep make
user 7177 0.0 0.0 1984 760 pts/1 D+ 16:13 0:00 make -f myMakefile

The load average goes up one unit each time I reproduce this test (apparently, processes in non-interruptible state are considered as running).

From then, the following actions does NOT unblock the process:
- stopping or restarting the NFS service on the server
- restarting the server
- restarting autofs on the client
- trying to unmount the NFS mount

If I reboot the client, all goes back to normal, until I repeat the
process below (ie. overwriting and compiling).
Typically, "shutdown -r" does not work, I have to "reboot -f".

There is nothing interesting in /var/log on the server nor on the
client.

Versions used on the server:
- SuSE 9.3
- kernel-default-2.6.11.4-21.11
- nfs-utils-1.0.7-3
- samba-3.0.13-1.1
- filesystem: reiserfs

On the client:
- SuSE 9.3
- kernel-smp-2.6.11.4-21.10
- nfs-utils-1.0.7-3
- mounts:
automount on /mount type autofs (rw,fd=4,pgrp=6529,minproto=2,maxproto=4)
serv:/dir on /mount/dir type nfs (rw,addr=*IP*)
- CPU: P4 with hyper threading (2 virtual CPUs)
Note: maxcpus=0 does not make any difference regarding this issue. I
could not test yet with kernel compiled without SMP at all.


On the following clients with the very same server, network, and mount tables I could not reproduce the problem:

- SuSE 9.1
- kernel-default-2.6.5-7.202.7
- nfs-utils-1.0.6-103
- CPU: P4 single core

- SuSE 10.0
- Kernel: 2.6.14.3-default (from kernel.org)
- nfs-utils-1.0.7-13


Any idea?
Seems to me as it is related to the SMP. What do you think?
How can I debug further?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/