rsync over ssh on 2.4.2 to 2.2.18

From: Russell King (rmk@arm.linux.org.uk)
Date: Tue Feb 27 2001 - 05:02:11 EST


Hi,

I'm seeing odd behaviour with rsync over ssh between two x86 machines -
one if the is an UP PIII (Katmai) running 2.4.2 (isdn-gw) and the other
is an UP Pentium 75-200 (pilt-gw) running 2.2.15pre13 with some custom
serial driver hacks (for running Amplicon cards with their ISA interrupt-
sharing scheme).

Please note: although I am using 2.2.15pre13, it is _not_ the cause of
this problem, and I don't particularly want to touch the 2.2.15pre13
machine at the present time, which has been happy for the past year or
so.

netstat on isdn-gw shows the following:

        Proto Recv-Q Send-Q Local Address Foreign Address State
        tcp 72868 0 isdn-gw.piltdown.a:1023 pilt-gw.piltdown.at:ssh ESTABLISHED

netstat on pilt-gw shows the following:

        Proto Recv-Q Send-Q Local Address Foreign Address State
        tcp 0 14372 pilt-gw.piltdown.at:ssh isdn-gw.piltdown.a:1023 ESTABLISHED

Looking on the network reveals:

        09:49:23.215852 P pilt-gw.ssh > isdn-gw.1023: . 1119640864:1119640864(0) ack 1175090336 win 31856 <nop,nop,timestamp 114834687 8389267> (DF) [tos 0x8]
        09:49:23.216095 P isdn-gw.1023 > pilt-gw.ssh: . 1:1(0) ack 1 win 0 <nop,nop,timestamp 8401267 114186976> (DF) [tos 0x8]

So isdn-gw is saying it can't accept more data, which ties up with what
netstat is indicating. Note that this situation has persisted for around
12 hours with neither end moving forward.

When I straced the ssh process on isdn-gw this morning, things start moving,
and in this instance everything completed normally, even after I stopped
stracing ssh. However, there are times when I have to leave strace running
to get the rsync to complete.

select(4, [3], [3], NULL, NULL) = 2 (in [3], out [3])
read(3, "\0\0\20\t\215R\241\355\201_o\227\4\371\340+\333_\333\34"..., 8192) = 8192
write(3, "\370\325|\355\232\256\247Kf\202\v\272\334@\231\317]\322"..., 135940) = 24616

On this instance, it appears that sshd has gotten stuck in select().
Maybe there is a race condition or missing wakeup in the TCP code?

Note that this problem also exists on 2.4.0 (I upgraded from 2.4.0 to
2.4.2 yesterday to see if the behaviour has changed).

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Feb 28 2001 - 21:00:13 EST