NFS got stuck on 2.3.22

Simon Kirby (sim@stormix.com)
Thu, 21 Oct 1999 12:37:06 -0400


Running 2.3.22 plus the array.c fixes from the pre-patches on a dual CPU
SMP machine, I was fiddling around with recording CD audio on a remote
machine via SSH and playing them with a wav player via NFS on my local
machine (that has a sound card). Everything was working up until when I
pressed ctrl-C on the wav player at one point, and it stopped the sound
but the process got stuck.

I went to another window and found I couldn't access the NFS mount point
any more, either.

Here is my `mount` output:

/dev/hda1 on / type ext2 (rw,noatime,nodiratime,nocheck,nocheck)
proc on /proc type proc (rw)
/dev/hda2 on /d1 type ext2 (rw,noatime,nodiratime,nocheck)
/dev/hdb3 on /d2 type ext2 (rw,noatime,nodiratime,nocheck)
foo:/ on /foo type nfs (rw,soft,intr,nolock,timeo=8,retrans=8,rsize=4096,wsize=4096,addr=192.168.143.1)
foo:/d1 on /foo/d1 type nfs (rw,soft,intr,nolock,timeo=8,retrans=8,rsize=4096,wsize=4096,addr=192.168.143.1)

I have /foo/d1 mounted over top of /foo because I was told a while ago
that I have to mount a remote system's different mount points also
through NFS to avoid inode hashing clashes.

I fiddled around trying to get it to do something again with no luck. I
looked in /proc and found some debugging variables in /proc/sys/sunrpc,
so I did "echo 255 > x" where x is "nfs_debug" and "rpc_debug".
rpc_debug gave me this output:

-pid- proc flgs status -client- -prog- --rqstp- -timeout -rpcwait -action- --exit--
35073 0006 0207 004196 c7c63280 100003 c7be215c 00000000 schedq c01a65e0 c0162bbc
35072 0006 0207 004196 c7c63280 100003 c7be24fc 00000000 schedq c01a65e0 c0162bbc
35071 0006 0207 004196 c7c63280 100003 c7be232c 00000000 schedq c01a65e0 c0162bbc
35070 0006 0207 004196 c7c63280 100003 c7be2074 00000000 schedq c01a65e0 c0162bbc
35069 0006 0207 004196 c7c63280 100003 c7be2b54 00000000 schedq c01a65e0 c0162bbc
35068 0006 0207 004196 c7c63280 100003 c7be2d24 00000000 schedq c01a65e0 c0162bbc
35067 0006 0207 004196 c7c63280 100003 c7be27b4 00000000 schedq c01a65e0 c0162bbc
35066 0006 0207 004196 c7c63280 100003 c7be289c 00000000 schedq c01a65e0 c0162bbc
35065 0006 0207 004196 c7c63280 100003 c7be2a6c 00000000 schedq c01a65e0 c0162bbc
35064 0006 0207 004196 c7c63280 100003 c7be2e0c 00000000 schedq c01a65e0 c0162bbc
35063 0006 0207 004196 c7c63280 100003 c7be2c3c 00000000 schedq c01a65e0 c0162bbc
35062 0006 0207 004196 c7c63280 100003 c7be26cc 00000000 schedq c01a65e0 c0162bbc
35061 0006 0207 004196 c7c63280 100003 c7be2414 00000000 schedq c01a65e0 c0162bbc
35060 0006 0207 004196 c7c63280 100003 c7be25e4 00000000 schedq c01a65e0 c0162bbc
35059 0006 0207 004196 c7c63280 100003 c7be2244 00000000 schedq c01a65e0 c0162bbc
35058 0006 0203 004096 c7c63280 100003 c7be2984 00000000 <NULL> 0 c0162bbc

I tried to do a "ls -ld /foo" and "ls -ld /foo/d1". /foo/d1 did not
work, but /foo did (just fine). I can also walk around the filesystem in
/foo as long as I don't touch /foo/d1. Here is the output on my console
when typing "ls -ld /foo":

NFS: revalidating ///, ino=1879048194
NFS call getattr
RPC: 35704 new task procpid 21165
RPC: 35704 looking up UNIX cred
RPC: gc'ing RPC credentials for auth c7c8f660
RPC: 35704 rpc_execute flgs 1
RPC: 35704 call_reserve
RPC: 35704 xprt_reserve cong = 0 cwnd = 640
RPC: 35704 reserved req c7bec074 xid cf7ebb47
RPC: wake_up_next(c7bec060 "xprt_backlog")
RPC: 35704 xprt_reserve returns 0
RPC: 35704 call_reserveresult (status 0)
RPC: 35704 call_allocate (status 0)
RPC: allocated buffer c3c95000
RPC: 35704 call_encode (status 0)
RPC: 35704 marshaling UNIX cred c079b060
RPC: 35704 call_transmit (status 0)
RPC: 35704 xprt_transmit(cf7ebb47)
RPC: 35704 added to queue c7bec058 "xprt_pending"
RPC: xprt_sendmsg(100) = 100
RPC: 35704 xmit complete
RPC: udp_data_ready...
RPC: udp_data_ready client c7bec000
RPC: 35704 received reply
RPC: 35704 has input (96 bytes)
RPC: wake_up_next(c7bec050 "xprt_sending")
RPC: 35704 call_receive (status 96)
RPC: 35704 xprt_receive
RPC: 35704 removed from queue c7bec058 "xprt_pending"
RPC: 35704 xprt_receive returns 96
RPC: 35704 call_status (status 96)
RPC: 35704 call_decode (status 96)
RPC: 35704 validating UNIX cred c079b060
RPC: attrstat status 0
RPC: attrstat OK type 2 mode 40755 dev 1 ino 70000002
RPC: 35704 call_decode result 0
RPC: 35704 exit() = 0
RPC: 35704 release task
RPC: 35704 deleting timer
RPC: 35704 release request c7bec074
RPC: wake_up_next(c7bec060 "xprt_backlog")
RPC: 35704 releasing UNIX cred c079b060
RPC: rpc_release_client(c7c63140, 1)
NFS reply getattr
NFS: refresh_inode(2/1879048194 ct=1)
NFS: /// revalidation complete
drwxr-xr-x 38 root root 2048 Oct 12 22:21 ./

Here is when I type "ls -ld /foo/d1"

NFS: revalidating ///, ino=1879082878
NFS call getattr
RPC: 35703 new task procpid 21164
RPC: 35703 looking up UNIX cred
RPC: 35703 rpc_execute flgs 1
RPC: 35703 call_reserve
RPC: 35703 xprt_reserve cong = 4096 cwnd = 4096
RPC: xprt_reserve waiting on backlog
RPC: 35703 sleep_on(queue "xprt_backlog" time 13387421)
RPC: 35703 added to queue c7be2060 "xprt_backlog"
RPC: 35703 setting alarm for 12000 ms
RPC: 35703 xprt_reserve returns 0
RPC: 35703 sync task going to sleep
RPC: 35703 timeout (default timer)
RPC: 35703 __rpc_wake_up (now 13388622 inh 0)
RPC: 35703 removed from queue c7be2060 "xprt_backlog"
RPC: __rpc_wake_up done
RPC: 35703 sync task resuming
RPC: 35703 call_reserveresult (status -110)
RPC: task timed out
RPC: 35703 call_timeout (major timeo)
nfs: server foo not responding, timed out
RPC: 35703 exit() = -5
RPC: 35703 release task
RPC: 35703 releasing UNIX cred c551d7a0
RPC: rpc_release_client(c7c63280, 17)
NFS reply getattr
nfs_revalidate_inode: /// getattr failed, ino=1879082878, error=-5
ls: d1: Input/output error

It seems that the "xprt_reserve cong" is stuck at 4096, seeming like a
queue is full of some sort. I did a "tcpdump" and watched for NFS
packets, but nothing at all gets sent out when accessing /foo/d1.
Packets are transferred with successful responses (according to tcpdump)
when accessing /foo. I waited several minutes to see and nothing ever
got retrasmitted (/foo/d1 remained "clogged").

"ps l" output shows that the wav player (aplay) is stuck in wait_on_page:

0 0 21021 19764 0 0 1084 432 wait_on_pag D 4 0:00 aplay track01.cdda.wav

Is there anything further I can to do figure out what happened here? I
believe this has happened three or four times in the past, but usually it
works fine. I don't know how easy this will be to reproduce, so I'll
leave my machine running in its current state for a little while.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communcations Inc. ]
[ sim@stormix.com ][ sim@netnation.com ]
[ Opinions expressed are not necessarily those of my employers. ]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/