Re: [2.6.21.1] totally hanging altough not crashed

From: Folkert van Heusden
Date: Fri Jul 20 2007 - 17:01:50 EST


> > One of my systems running 2.6.21.1 on a P4 with HT and 2GB of ram
> > occasionally crashes. Not with an oops or panic, it just suddenly stops
> > doing anything. It still lets you ping and it still forwards ip traffic
> > but you can't login: not via ssh and not on the console - pressing enter
> > brings the cursor to the next line and that's it. It is NOT swapping at
> > all, in fact alt+sysrq+m says that there's still plenty of memory and
> > swap available:
>
> Lots of processes are in uninterruptible wait at
> start_this_handle+0x208/0x365. Can you find out what line of code
> that matches? (There are a few of those waits in that function.)

I did some more investigating and found:

folkert@muur:~$ grep -A 1 "Call Trace" www/ast.txt | grep -v -e "^--" -v -e "Call Trace:" | cut -d " " -f 2- | genstats | more
1 358 54.82% 1.75 [<c120f326>] schedule_timeout+0x8c/0x8e
2 88 13.48% 7.42 [<c10c23a0>] start_this_handle+0x208/0x365
3 63 9.65% 10.10 [<c120f2e0>] schedule_timeout+0x46/0x8e
4 36 5.51% 17.69 [<c1020868>] do_wait+0x2c4/0x396
5 36 5.51% 16.81 [<c1098cbd>] inotify_read+0x8d/0x1b5
6 17 2.60% 37.59 [<c1074f52>] pipe_wait+0x8a/0xab
7 13 1.99% 2.69 [<c102d22e>] worker_thread+0x130/0x165
8 11 1.68% 55.73 [<c1210307>] do_nanosleep+0x42/0x70
9 7 1.07% 92.29 [<c120f5e9>] __mutex_lock_slowpath+0xac/0x28c
10 7 1.07% 90.71 [<c101f9fc>] do_exit+0x253/0x428
11 2 0.31% 145.50 [<c1138101>] write_chan+0x15b/0x1d6
12 2 0.31% 3.50 [<c104de2a>] watchdog+0x47/0x55
13 2 0.31% 3.00 [<c10224f6>] ksoftirqd+0x85/0x98
14 2 0.31% 2.50 [<c1018e44>] migration_thread+0x8a/0x10f
15 1 0.15% 617.00 [<c10c7d56>] log_wait_commit+0xb5/0x121
16 1 0.15% 562.00 [<c1029e46>] sys_pause+0x14/0x1b
17 1 0.15% 225.00 [<f8a58ae1>] schluffen+0xad/0xaf [zaptel]
18 1 0.15% 63.00 [<c1029efc>] sys_rt_sigsuspend+0xaf/0xcf
19 1 0.15% 23.00 [<c10c75ff>] kjournald+0x1fd/0x207
20 1 0.15% 22.00 [<c10c4b6f>] journal_commit_transaction+0x242/0xcd3
21 1 0.15% 17.00 [<c105ab1b>] kswapd+0xf7/0x10b
22 1 0.15% 16.00 [<c1191647>] serio_thread+0xfa/0xff
23 1 0.15% 15.00 [<c11730e7>] hub_thread+0xe6/0xe8

(please ignore the fourth column)

As I'm not entirely sure that this an innocent to be in for a kernel I
put this point into gdb:

(gdb) p schedule_timeout
$1 = {long int (long int)} 0xc121a8b4 <schedule_timeout>
(gdb) l *0xC120F3B2
0xc120f3b2 is in __xfrm_state_insert (net/xfrm/xfrm_hash.h:49).
44 static inline unsigned __xfrm_src_hash(xfrm_address_t *daddr,
45 xfrm_address_t *saddr,
46 unsigned short family,
47 unsigned int hmask)
48 {
49 unsigned int h = family;
50 switch (family) {
51 case AF_INET:
52 h ^= __xfrm4_daddr_saddr_hash(daddr, saddr);
53 break;

I very much hope it is of any help.


Folkert van Heusden

--
www.vanheusden.com/multitail - win een vlaai van multivlaai! zorg
ervoor dat multitail opgenomen wordt in Fedora Core, AIX, Solaris of
HP/UX en win een vlaai naar keuze
----------------------------------------------------------------------
Phone: +31-6-41278122, PGP-key: 1F28D8AE, www.vanheusden.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/