Re: nfs client hang

From: Andy Chittenden
Date: Tue Jul 27 2010 - 03:25:24 EST


On 2010-07-23 13:36, Andy Chittenden wrote:
IE the client starts a connection and then closes it again without sending data.
Once this happens, here's some rpcdebug info for the rpc module using 2.6.34.1 kernel:

... lots of the following nfsv3 WRITE requests:
[ 7670.026741] 57793 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026759] 57794 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026778] 57795 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026797] 57796 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026815] 57797 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026834] 57798 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026853] 57799 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026871] 57800 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026890] 57801 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7670.026909] 57802 0001 -11 ffff88012e32b000 (null) 0 ffffffffa03beb10 nfsv3 WRITE a:call_reserveresult q:xprt_backlog
[ 7680.520042] RPC: worker connecting xprt ffff88013e62d800 via tcp to 10.1.6.102 (port 2049)
[ 7680.520066] RPC: ffff88013e62d800 connect status 99 connected 0 sock state 7
[ 7680.520074] RPC: 33550 __rpc_wake_up_task (now 4296812426)
[ 7680.520079] RPC: 33550 disabling timer
[ 7680.520084] RPC: 33550 removed from queue ffff88013e62db20 "xprt_pending"
[ 7680.520089] RPC: __rpc_wake_up_task done
[ 7680.520094] RPC: 33550 __rpc_execute flags=0x1
[ 7680.520098] RPC: 33550 xprt_connect_status: retrying
[ 7680.520103] RPC: 33550 call_connect_status (status -11)
[ 7680.520108] RPC: 33550 call_transmit (status 0)
[ 7680.520112] RPC: 33550 xprt_prepare_transmit
[ 7680.520118] RPC: 33550 rpc_xdr_encode (status 0)
[ 7680.520123] RPC: 33550 marshaling UNIX cred ffff88012e002300
[ 7680.520130] RPC: 33550 using AUTH_UNIX cred ffff88012e002300 to wrap rpc data
[ 7680.520136] RPC: 33550 xprt_transmit(32920)
[ 7680.520145] RPC: xs_tcp_send_request(32920) = -32
[ 7680.520151] RPC: xs_tcp_state_change client ffff88013e62d800...
[ 7680.520156] RPC: state 7 conn 0 dead 0 zapped 1
I changed that debug to output sk_shutdown too. That has a value of 2 (IE SEND_SHUTDOWN). Looking at tcp_sendmsg(), I see this:

err = -EPIPE;
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;

which correlates with the trace "xs_tcp_send_request(32920) = -32". So, this looks like a problem in the sockets/tcp layer. The rpc layer issues a shutdown and then reconnects using the same socket. So either sk_shutdown needs zeroing once the shutdown completes or should be zeroed on subsequent connect. The latter sounds safer.

--
Andy, BlueArc Engineering

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/