Re: PROBLEM: Can ping address, but traceroute gets ENETDOWN

From: Eric Dumazet
Date: Tue Jul 17 2012 - 09:30:50 EST


On Tue, 2012-07-17 at 09:04 -0400, Terry Phelps wrote:
> I'm seeing, to me, totally illogical behavior with my IPv4 networking.
> Can someone please help me isolate the problem better?
>
> I have at least EIGHT servers with the same symptom. All are running
> Oracle "Unbreakable Enterprise Kernel 2". Oracle numbers this kernel
> 2.6.39.*, but it is "based on the 3.0.16 kernel". I don't know exactly
> what patches might have been applied. The symptom I see is:
>
> I'm SSH'ed into the server from my desk another network. All is well.
> Then either (1) SSH freezes, or (2) I exit SSH, and can't SHH to it
> again.
> Then I ping the server from my desk. It FAILS.
> I ping the server from a second machine on my desk (same network). It works.
> If I keep pinging from my desktop, where the SSH just failed, it will
> NEVER get a response. I've let it ping for DAYS.
> But if I stop pinging for 5 minutes or so, it'll work just fine again.
> While things are "hosed", I am able to ping and ssh from my second
> desktop to the server just fine.
> If I SSH to the server, it CAN ping my desktop, but it CANNOT traceroute to it.
> If I leave the ping going (and failing), and go to the server and "ip
> route flush cache", the pings start working immediately.
> I can get the problem from other desktops on other networks, but I
> have never seen it from another server on the same network.
>
> It gets stranger. Here are some commands run on the server, while the
> pings from my desktop are failing. The failing pings are coming from
> 192.168.118.22. The machine right next that one is .23, and it works
> fine.
>
> I have ONE NIC in the box, and I have no reason to think it isn't
> configured properly.
>
> # ifconfig -a
> eth0 Link encap:Ethernet HWaddr 00:50:56:9A:00:17
> inet addr:172.16.2.95 Bcast:172.16.255.255 Mask:255.255.0.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:246266059 errors:0 dropped:85001 overruns:0 frame:0
> TX packets:290982046 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000
> RX bytes:70745127855 (65.8 GiB) TX bytes:27490797799 (25.6 GiB)
>
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:258548668 errors:0 dropped:0 overruns:0 frame:0
> TX packets:258548668 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:226377171068 (210.8 GiB) TX bytes:226377171068 (210.8 GiB)
>
>
> The server can ping my desktop just fine:
>
> # ping 192.168.118.22
> PING 192.168.118.22 (192.168.118.22) 56(84) bytes of data.
> 64 bytes from 192.168.118.22: icmp_seq=1 ttl=127 time=0.827 ms
> 64 bytes from 192.168.118.22: icmp_seq=2 ttl=127 time=0.739 ms
> 64 bytes from 192.168.118.22: icmp_seq=3 ttl=127 time=0.725 ms
>
>
>
> But a traceroute to the same destination says "network is down":
>
> # traceroute 192.168.118.22
> traceroute to 192.168.118.22 (192.168.118.22), 30 hops max, 40 byte packets
> send: Network is down
>
>
>
> A syscall trace of traceroute shows the sendto() call getting a
> ENETDOWN response:
>
>
> socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
> setsockopt(3, SOL_IP, IP_MTU_DISCOVER, [0], 4) = 0
> setsockopt(3, SOL_SOCKET, SO_TIMESTAMP, [1], 4) = 0
> fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
> setsockopt(3, SOL_IP, IP_TTL, [1], 4) = 0
> setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0
> connect(3, {sa_family=AF_INET, sin_port=htons(33434),
> sin_addr=inet_addr("192.168.118.22")}, 28) = 0
> sendto(3, "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_"..., 40, 0, NULL, 0) = -1
> ENETDOWN (Network is down)
>
>
>
> Yet traceroute (and ping) to a machine on the same network is fine:
>
> # traceroute 192.168.118.23
> traceroute to 192.168.118.23 (192.168.118.23), 30 hops max, 40 byte packets
> 1 172.16.16.253 (172.16.16.253) 1.304 ms 1.614 ms 1.886 ms
> 2 192.168.118.23 (192.168.118.23) 0.521 ms 0.566 ms 0.562 ms
>
>
>
> I have a default route, and no other routes defined:
>
> # netstat -nr
> Kernel IP routing table
> Destination Gateway Genmask Flags MSS Window irtt Iface
> 0.0.0.0 172.16.0.5 0.0.0.0 UG 0 0 0 eth0
> 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
> 172.16.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
>
>
>
> Here are my route cache entries for the network I'm trying to talk to:
>
> # netstat -nrC|grep 192.168.118
> 172.16.2.95 192.168.118.22 172.16.70.101 1500 0 239 eth0
> 192.168.118.23 172.16.2.95 172.16.2.95 l 16436 0 0 lo
> 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0
> 192.168.118.22 172.16.2.95 172.16.2.95 l 16436 0 0 lo
> 172.16.2.95 192.168.118.22 172.16.70.101 1500 0 239 eth0
> 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0
> 172.16.2.95 192.168.118.22 172.16.70.101 1500 0 239 eth0
> 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0
> 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0
>
>
>
> And finally, tcpdump shows that the pings from my desktop ARE
> arriving. They are simply
> not being replied to:
>
> # tcpdump -np host 192.168.118.22
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
> 10:20:48.950240 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id
> 2, seq 35155, length 40
> 10:20:54.956584 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id
> 2, seq 35158, length 40
> 10:21:00.959048 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id
> 2, seq 35161, length 40
> 10:21:06.964326 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id
> 2, seq 35164, length 40
>
>
> If you could PLEASE advise me on where to go from here, I would
> greatly appreciate it. I can't imagine what would cause these
> symptoms.
>
> Here is the ver_linux output:
>
> Linux jidlam01.acbl.net 2.6.39-200.29.1.el5uek #1 SMP Fri Jul 6
> 08:01:33 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
>
> Gnu C 4.1.2
> Gnu make 3.81
> binutils 2.17.50.0.6
> 8.3
> util-linux 2.13-pre7
> mount 2.13-pre7
> module-init-tools 3.3-pre2
> e2fsprogs 1.39
> pcmciautils 014
> quota-tools 3.13.
> PPP 2.4.4
> Linux C Library 2.5
> Dynamic linker (ldd) 2.5
> Procps 3.2.7
> Net-tools 1.60
> Kbd 1.12
> Sh-utils 5.97
> udev 095
> wireless-tools 28
> Modules Loaded autofs4 hidp rfcomm bluetooth rfkill lockd
> sunrpc be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa
> ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi
> cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc
> hed acpi_memhotplug acpi_ipmi ipmi_msghandler lp sg sr_mod cdrom
> snd_seq_dummy serio_raw e1000 vmw_balloon snd_seq_oss
> snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss
> snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr parport_pc
> i2c_piix4 i2c_core parport floppy pata_acpi ata_generic dm_snapshot
> dm_zero dm_mirror dm_region_hash dm_log dm_mod ata_piix shpchp mptspi
> mptscsih mptbase scsi_transport_spi sd_mod crc_t10dif ext3 jbd mbcache
>
>
> Terry Phelps
> American Commercial Lines
> Jeffersonville, IN

Hi

This looks like a firewall issue, check :

iptables -nvL



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/