kernel bug related to dumping core over NFS? (machine hangs)

Zubin D. Dittia (zubin@growthnetworks.com)
Tue, 05 Oct 1999 02:47:56 -0700


Hi,
This is my first posting to this list, not sure if it is the right forum to
bring this up... I believe there is either a bug or something very wrong
with the way linux dumps a program's core file over NFS. Here is
a description of the problem.

I have been using Linux as a desktop system which I also use to run
heavy duty compute and memory intensive simulations. The machine
I have is a Dell PC which came pre-installed with Linux. It has two
processors, so it uses the SMP kernel (version 2.2.5-15smp). The PC
has 1 GB of memory. However, it has a small disk, so I have NFS mounted
most of my files from a Solaris machine which is the file server. The
machines are connected by 100 Mb/sEthernet.

The simulations I run have a very large memory image, usually
several hundred megabytes. The problem occurs when my program
seg faults and dumps core over NFS to the remote disk. In this case,
I notice that the entire machine hangs until the core dump completes,
and the core dump takes a VERY long time to finish: often an hour
or more!!! This effectively means that the machine is unusable for
anything during this period.

I decided to pay closer attention to what was going on during the time
it spends dumping core. The first thing I noticed (by looking at the
LEDs on my ethernet hub) that the core dump would proceed in
bursts: data would be transferred for a few seconds, and then the
network would go idle and no more data would be transferred for
a few minutes (usually 3 to 5 minutes), and then it would start up
again and transfer another chunk of data. The chunks transferred
varied in size from ~2MB to ~90MB (I could find that out by
monitoring the size of the core file on the server). Obviously, most
of the time was spent in the idle periods between these bursts, during
which nothing seemed to be happening.

During the idle periods, I could ping the linux machine, but I could
not do anything else (for example, telnet would connect but would not
give me a login prompt, and X-windows was stuck, the mouse would
not move). After the dump completed, the machine was back to
normal.

After the dump had completed and my machine came alive again,
I went in and did a dmesg. The output was something like the
following:

nfs: server kauai OK
nfs: server kauai OK
nfs: server kauai OK
nfs: server kauai OK
nfs: server kauai OK
nfs: server kauai OK
nfs: server kauai OK
nfs: task 26069 can't get a request slot
nfs: task 26070 can't get a request slot
nfs: task 26071 can't get a request slot
nfs: task 26018 can't get a request slot
nfs: task 25924 can't get a request slot
nfs: task 26073 can't get a request slot
nfs: task 26074 can't get a request slot
nfs: task 26075 can't get a request slot

and this kind of output kept repeating over and over.

Does anyone have an idea of what is going on, and what I can do to fix this?
Many thanks,
-Zubin

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/