ppp and 2.2.12 hang - anyone else seen this?

rmk@arm.linux.org.uk
Mon, 27 Sep 1999 22:44:12 +0100 (BST)


Hi,

I'm currently using 2.2.12 as a firewall on an ARM machine (naturally) with
32MB, root NFS (served by unfsd), and two serial ports (yes, no display) and
a 10bT network connection.
(see http://www.arm.linux.org.uk/armlinux/ebsa110/desc.html)

Since it has been upgraded to 2.2.12 (I forget what the previous version was,
maybe 2.2.10), it has locked solid a couple of times. I believe that this time
it had an uptime of almost 13 days. Both crashes have been similar in the sense
that the machine appears to lock solid, but only the latest one is described in
detail here:

I normally leave a pppd 2.3.8 running (pppd crtscts lock idle 0 local proxyarp
persist silent 192.168.0.1:192.168.0.9) on one serial port for my laptop, and
the other one is an intermittent pppd connection to the `net.

1. started up the `net pppd connection, which transfered my mail
(used ppp1, pid 32449). This went fine, and terminated normally.
2. shortly after, I started pppd on the laptop, which negociated correctly a
deflate compressed connection

Sep 27 19:06:30 caramon pppd[25193]: Deflate (15) compression enabled
Sep 27 19:06:31 caramon pppd[25193]: found interface eth0 for proxy arp
Sep 27 19:06:31 caramon pppd[25193]: local IP address 192.168.0.1
Sep 27 19:06:31 caramon pppd[25193]: remote IP address 192.168.0.9

3. started a couple of ssh sessions to one of my other machines, and one to
the EBSA110.

Sep 27 19:06:34 flint sshd[15698]: log: Connection from 192.168.0.9 port 1023
(etc)

4. started the `net connection again, which didn't manage to negociate
IPCP before the modem dropped the line.
5. started the `net connection again, managed to negociate IPCP this time,
but again dropped the line. Meanwhile, I was prodding the interface
with ifconfig ppp1 from the laptop so I knew when the link came up.
I don't believe the connection properly established for one reason or
another - ntpdate said "no server suitable for synchronization found"
6. tried again, and this time successful negociation, and everything looked
fine. However, I noticed that there were the following messages every
so often since (3) above:

Sep 27 19:34:18 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0xd
Sep 27 19:34:19 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0x6e6f
Sep 27 19:34:19 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0xd

They are always those protocol numbers, without fail, every single time
in that order. The frequency increased from 5 minutes to every 2 seconds
at one point.

7. I then closed the connection to the EBSA110. This caused the protocol-rejects
to slow down for some reason:

Sep 27 19:35:16 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0xd
Sep 27 19:35:16 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0x6e6f
Sep 27 19:35:16 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0xd
Sep 27 19:35:19 caramon sshd[770]: log: Closing connection to 192.168.0.9
Sep 27 19:35:47 caramon init: no more processes left in this runlevel
Sep 27 19:35:47 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0xd
Sep 27 19:35:47 caramon pppd[25193]: Protocol-Reject for unsupported protocol 0x6e6f

8. Wondering what was going on, I started typing pppdump on an already open
ssh connection from flint to caramon (EBSA110). I only got to 'pppdu' before
a complete freeze, which was at 19:35:47, the last log message.

9. pinging caramon from flint, and from flint to the laptop worked for a while,
but this also stopped responding (about 2 minutes later). From this point on,
it was totally dead to everything.

Unfortunately, the serial console is on the laptop's connection, and I was unable
to grab any of it. If there was an oops, there wasn't anything suitable listening
to record it ;( Try harder next time.

I don't think that this is a method to reproduce this problem, and this problem
may well be specific to my code/setup/whatever, but has anyone else experienced
anything like this with 2.2.12? I don't see anything pppd-related in Alan's
2.2.13ac patches, so maybe the ppp messages are a red herring?

I don't think it's related to the cable - I've used this cable for many hours
before (in the same uptime with the same pppd running) without this complaint.
It just seems a little strange that it should start complaining a lot just
before a major crash about what should be a 100% reliable link.

My only thoughts on this are:

1. Could the interrupt latency be increased for some un-related reason so
the serial port dropped characters?
2. Did something happen (eg, bad data somewhere) which caused an oops in an
interrupt handler, which in turn left in_interrupt() returning non-zero,
and therefore stopping the scheduler. Other parts of the kernel then
tripped over this same bad data, hence the progressive crash?
_____
|_____| ------------------------------------------------- ---+---+-
| | Russell King rmk@arm.linux.org.uk --- ---
| | | | http://www.arm.linux.org.uk/~rmk/aboutme.html / / |
| +-+-+ --- -+-
/ | THE developer of ARM Linux |+| /|\
/ | | | --- |
+-+-+ ------------------------------------------------- /\\\ |

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/