slow decay of GRE tunnel over several hours

David (david@kalifornia.com)
Wed, 23 Dec 1998 14:18:49 -0800


Synopsis: over approximately 6-12 hours, a GRE ipip tunnel decays,
silently failing when transmitting packets.

The two servers for endpoints of the tunnel are james[207.213.0.47] and
midnight[dynamic].

Descriptions:

james: has two ethernet cards and acts as a simple router for several
devices to include standard /25 and smaller subnets, a vpn over
a GRE tunnel, and a vpn over ssh+pppd.
2.1.131ac11, amdk6-233, 192megs, pgcc-2.91.60

midnight: has one dialup modem and one network card. acts as the router
for a subnet sourced from the GRE tunnel, used to employ the
ssh+pppd solution.
2.1.131ac11, amdk6-233, 128megs, pgcc-2.91.60

eth0 is the network card going to the world
eth1 is the network card for subnets
gre1 is the GRE IPIP tunnel
ppp0 is the ssh+pppd vpn

the following are routed to james:

207.213.0.47
207.213.14.0/24 (lower half tunneled, upper half not used except for endpoints)
207.213.15.0/25,25 (lower half exists as aliases on eth0, upper half
spread on eth1 and vpns)

207.213.15.129 is used for transport endpoints of the tunnels because of
the nature of unreachability of that ip for hosts not on the router.

[internet] -> james[eth0:207.213.0.47] -> [eth0:subnet:207.213.15.0/25]
-> [eth1]

[eth1:207.213.15.129] -> [eth1:10baseT hub subnets]
-> [eth1:ppp0:207.213.15.241]
-> [gre1:207.213.14.130]

[ppp0:207.213.15.241:207.213.15.243] (ssh works ok,
just slow and has
a few peculiarities)

[gre1:207.213.14.130(l):207.213.14.131(r)] carries 14.0/25 subnet

[internet] -> midnight[ppp0:dynamic] -> eth0:207.213.14.2, 207.213.14.7]
-> eth0:10baseT subnet:207.213.14.0/25]
-> [gre1]

[gre1:207.213.14.131(l):207.213.14.130(r)] carries 14.0/25 subnet

scripts used to establish GRE vpn:

(midnight is responsible for establishing it)

#!/bin/bash
#
# this script runs on midnight; contacts james.
ip=`ifconfig ppp0|grep inet|cut -c26-41|tr -d "\033"|sed 's/\[0m//g'`
echo "Contacting remote point"

/usr/local/sbin/ip link set gre1 down 2>/dev/null
/usr/local/sbin/ip tunnel del gre1 2>/dev/null
/usr/local/sbin/ip tunnel add gre1 mode gre remote 207.213.15.129 local $ip
/usr/local/sbin/ip addr add 207.213.14.131/31 dev gre1
/usr/local/sbin/ip link set gre1 up

/sbin/ifconfig eth0:2 207.213.14.7
/sbin/route del -net 207.213.14.0 netmask 255.255.255.128 dev eth0 2>/dev/null
/sbin/route add -net 207.213.14.0 netmask 255.255.255.128 dev eth0 2>/dev/null
/sbin/route add default gw 207.213.14.131 dev gre1
ssh 207.213.15.129 "/etc/rc.d/rc.vpn-home $ip" 2>/dev/null

# cat /etc/rc.d/rc.vpn-home
#!/bin/bash

echo "initializing GRE device on `hostname -f`"
echo "Connecting GRE tunnel to $1"

/usr/local/sbin/ip link set gre1 down 2>/dev/null
/usr/local/sbin/ip tunnel del gre1 2>/dev/null
/usr/local/sbin/ip tunnel add gre1 mode gre remote $1 local 207.213.15.129
/usr/local/sbin/ip addr add 207.213.14.130/31 dev gre1
/usr/local/sbin/ip link set gre1 up
/sbin/route add -net 207.213.14.0 netmask 255.255.255.128 gw 207.213.14.130 dev gre1

echo "link gre1 online."
/sbin/ifconfig gre1|sed 's/00/0/g'

(resulting output)
# rc.vpn
Contacting remote point
initializing GRE device on james.kalifornia.com
Connecting GRE tunnel to 38.29.58.7
link gre1 online.
gre1 Link encap:UNSPEC HWaddr CF-D5-0F-81-0-0-0-FD-0-0-0-0-0-0-0-0
inet addr:207.213.14.130 P-t-P:207.213.14.130 Mask:255.255.255.254
UP POINTOPOINT RUNNING NOARP MTU:1476 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0

Routing:
(james)
# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
207.213.14.130 0.0.0.0 255.255.255.254 U 0 0 0 gre1
207.213.15.176 207.213.15.129 255.255.255.248 UG 0 0 0 eth1
207.213.15.128 207.213.15.129 255.255.255.240 UG 0 0 0 eth1
207.213.15.224 207.213.15.129 255.255.255.240 UG 0 0 0 eth1
207.213.0.32 0.0.0.0 255.255.255.224 U 0 0 0 eth0
207.213.15.192 207.213.15.129 255.255.255.224 UG 0 0 0 eth1
207.213.14.0 207.213.14.130 255.255.255.128 UG 0 0 0 gre1
207.213.15.0 0.0.0.0 255.255.255.128 U 0 0 0 eth0
0.0.0.0 207.213.0.33 0.0.0.0 UG 0 0 0 eth0

(midnight)
# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
38.1.1.1 0.0.0.0 255.255.255.255 UH 0 0 0 ppp0
207.213.15.129 0.0.0.0 255.255.255.255 UH 0 0 0 ppp0
207.213.14.130 0.0.0.0 255.255.255.254 U 0 0 0 gre1
207.213.14.0 0.0.0.0 255.255.255.128 U 0 0 0 eth0
207.213.14.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
0.0.0.0 207.213.14.131 0.0.0.0 UG 0 0 0 gre1
#

Timeline:

reboot of james: everything works fine
several hours later: SMTP connections cease to 207.213.14.2, ssh will work.
ssh works for a while because it binds to a local
socket before attempting to connect. a telnet to
port 25 results in the following:

connect(3, {sin_family=AF_INET, sin_port=htons(25),
sin_addr=inet_addr("207.213.14.2")}, 16 <unfinished ...>

if i use ssh instead of telnet, the bind happens,
then the connect succeeds. this is ONLY from james
to midnight. connections from midnight will always
succeed throughout this entire process. (this may
simply be the number of packets processed by each
machine)

a few hours later: connect() fails more. ssh now fails. the bind no
longer helps. all connects silently fail.

a few more hours: connect() starts returning out of buffer space
errors.

SysRq: Show Memory
Mem-info:
Free pages: 13072kB
( Free: 3268 (256 512 768)
128*4kB 164*8kB 189*16kB 107*32kB 19*64kB 28*128kB = 13072kB)
Swap cache: add 0/0, delete 0/0, find 0/0
Free swap: 0kB
49152 pages of RAM
1115 reserved pages
13114 pages shared
0 pages swap cached
29 pages in page table cache
Buffer memory: 38580kB
Buffer heads: 38604
Buffer blocks: 38580
CLEAN: 33104 buffers, 41 used (last=475), 0 locked, 0 protected, 0 dirty
LOCKED: 5408 buffers, 76 used (last=5379), 0 locked, 0 protected, 0 dirty
DIRTY: 26 buffers, 0 used (last=0), 0 locked, 0 protected, 26 dirty
Networking buffers in use : 423
Total network buffer allocations : 5355600
Total failed network buffer allocs : 0
IP fragment buffer size : 0

# cat /proc/sys/net/ipv4/ip_local_port_range
1024 16384

tearing down the GRE tunnel and restablishing it in exactly proper order
appears to fix things. this procedure is not guaranteed and several
times, james has to be rebooted.

as a side note, with 2.1.132pre4 on james, the GRE tunnel would establish
but would not transfer packets in either direction. i will attempt
2.1.132 full (and 133+) this evening. the exact same kernel is being run
on both machines with the only difference being one is using a mylex scsi
card, the other a fireport. both have tulip cards in them with tulip
version .90. if GRE tunneling is not employed, networking remains stable.

please let me know what further information is desired.

-d

-- 
  Look, Windows 98  Buy, lemmings, buy!  MCSE, Must Consult Someone Experienced
__ (c) 1998 David Ford.  Redistribution via the Microsoft Network is prohibited
\/  for linux-kernel: please read linux/Documentation/* before posting problems

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/