RE: Consistent kernel hang during heavy TCP connection handling load

From: Andrew A.
Date: Thu Sep 23 2004 - 19:20:06 EST



I would not normally quote an an entire message, but it contains data relevant to this problem.

The hang below occurs even outside of GDB, and also occurs after upgrading the kernel:

Linux bbox.memeplex.com 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004 i686 i686 i386 GNU/Linux



Can anyone please give me a clue/pointer to tools/techniques that might help identify where in the kernel the hang occurs? The
system is so completely unresponsive when this occurs that I cannot provide any forensic data.

Does anyone's experience show that these types of hangs might occur purely as the result of use (or mis-use) of the pthreads
library? I'm looking for hints about what parts of my code to review.

There could easily be erroneous calls to pthread_detach(), pthread_join(), close(), and other system calls involved.

Thanks,
Andrew Athan



-----Original Message-----
From: linux-kernel-owner@xxxxxxxxxxxxxxx
[mailto:linux-kernel-owner@xxxxxxxxxxxxxxx]On Behalf Of Andrew A.
Sent: Wednesday, September 22, 2004 3:17 PM
To: linux-kernel@xxxxxxxxxxxxxxx
Subject: Consisten kernel hang during heavy TCP connection handling load




I have written a pair of applications the server side of which consistently causes my Linux Fedora Core 2 system to become
completely unresponsive; all consoles hang, and it no longer services network connections.

The applications engage in the rapid opening and closing of TCP connections. The server side is multithreaded (# threads approx 5).
It services the connections by dumping data into them from a file. The client side reads no data. The server then receives EAGAIN
from send(...,MSG_NOWAIT) calls, and issues 5ms sleep before resending on any particular TCP connection. It loops up to 20 times
waiting for the the connection to become unblocked. The applications are running within GDB, and threads *are* created/destroyed
during the process.

I will change the application to use select() rather than sleeping on a blocked pipe. However, I don't think it's a "good thing"
that the machine hangs so completely.

I looked for tools to help catch the kernel before it goes la-la (assuming it's the kernel going la-la), but got
frustrated/ran-out-of-time. E.g., lkcd seems defunct.

If pointed in the right direction, I would be happy to perform further forensics after re-creating the hang. I am also in the
process of upgrading the kernel to see if that resolves the problem.

Andrew Athan




uname -a:

Linux bbox.memeplex.com 2.6.6-1.435 #1 Mon Jun 14 09:09:07 EDT 2004 i686 i686 i386 GNU/Linux

lsmod:

Module Size Used by
snd_mixer_oss 13824 2
snd_via82xx 20644 3
snd_ac97_codec 54788 1 snd_via82xx
snd_pcm 69256 1 snd_via82xx
snd_timer 17284 1 snd_pcm
snd_page_alloc 8072 2 snd_via82xx,snd_pcm
gameport 3328 1 snd_via82xx
snd_mpu401_uart 4864 1 snd_via82xx
snd_rawmidi 17444 1 snd_mpu401_uart
snd_seq_device 6152 1 snd_rawmidi
snd 39396 10
snd_mixer_oss,snd_via82xx,snd_ac97_codec,snd_pcm,snd_timer,snd_mpu401_uart,snd_rawmidi,snd_seq_device
soundcore 6112 3 snd
ipt_mark 1408 2
ipt_MARK 1664 14
cls_u32 5508 2
cls_fw 3200 2
sch_sfq 4352 9
sch_htb 18048 1
iptable_mangle 2176 1
ip_tables 13568 3 ipt_mark,ipt_MARK,iptable_mangle
nfsd 159488 9
exportfs 4224 1 nfsd
lockd 47816 2 nfsd
parport_pc 19392 1
lp 8236 0
parport 29640 2 parport_pc,lp
autofs4 12932 0
sunrpc 109924 19 nfsd,lockd
via_rhine 15752 0
mii 3584 1 via_rhine
floppy 47440 0
sg 27680 0
scsi_mod 91984 1 sg
microcode 4768 0
dm_mod 32800 0
ehci_hcd 22916 0
uhci_hcd 24472 0
button 4632 0
battery 6924 0
asus_acpi 8984 0
ac 3340 0
r128 85796 2
ipv6 184672 18
ext3 103656 2
jbd 40728 1 ext3


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/