Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)

From: Rob Mueller
Date: Mon Jul 12 2004 - 14:54:56 EST



Things will be much easier for you if you configure a serial or network
console.

It's just crud on the stack, you're really waiting in io_schedule() for
a page to get unlocked. Why isn't the page unlocking? Hard to say for
sure without seeing the whole sysrq-t. If the network/serial console
doesn't work out, I can help you configure lkcd as well.

Well, I tried compiling in the network console, but it seems to be way too buggy. Basically the machine would crash (hard lockup) within about 12-24 hours after booting, nothing on the network console itself or in any log file. Not much help there.

Anyway, after rebooting back into a non-netconsole enabled kernel, we did get another stuck process. This time there was only 1, and I was able to shutdown all the other processes, so that there were only about 50 procs running when I did the sysreq-t command, so I should have been able to capture all the output this time??? I've put the dumps here:

http://robm.fastmail.fm/kernel/t2/

Here's the relevant stuck proc.

imapd D E17BE6E0 0 3761 1 10291 (NOTLB)
e11c3bc8 00000086 00000020 e17be6e0 c1372d20 00000246 00000220 f7e12380
00000020 c0136667 c42c6da0 00000001 00000d74 bbfe8a6a 0000040d c42c6da0
f7f91140 e17be6e0 e17be890 f78cd9cc 00000003 f78cd9cc f78cd9cc c025d2cc
Call Trace:
[<c0136667>] kmem_cache_alloc+0x57/0x70
[<c025d2cc>] generic_unplug_device+0x2c/0x40
[<c037a148>] io_schedule+0x28/0x40
[<c012e03c>] __lock_page+0xbc/0xe0
[<c012dd70>] page_wake_function+0x0/0x50
[<c012dd70>] page_wake_function+0x0/0x50
[<c012f061>] filemap_nopage+0x231/0x360
[<c013dc18>] do_no_page+0xb8/0x3a0
[<c013ba7b>] pte_alloc_map+0xdb/0xf0
[<c013e0ae>] handle_mm_fault+0xbe/0x1a0
[<c025d292>] __generic_unplug_device+0x32/0x40
[<c0112af2>] do_page_fault+0x172/0x5ec
[<c014cab0>] bh_wake_function+0x0/0x40
[<c014cab0>] bh_wake_function+0x0/0x40
[<c018ec9f>] reiserfs_prepare_file_region_for_write+0x94f/0x9b0
[<c0112980>] do_page_fault+0x0/0x5ec
[<c0104b19>] error_code+0x2d/0x38
[<c018dc0f>] reiserfs_copy_from_user_to_file_region+0x8f/0x100
[<c018f2b1>] reiserfs_file_write+0x5b1/0x750
[<c0186675>] reiserfs_link+0xb5/0x190
[<c0186719>] reiserfs_link+0x159/0x190
[<c016134c>] dput+0x1c/0x1b0
[<c016134c>] dput+0x1c/0x1b0
[<c01581a0>] path_release+0x10/0x40
[<c015a9bc>] sys_link+0xcc/0xe0
[<c014bb9a>] vfs_write+0xaa/0xe0
[<c014b610>] default_llseek+0x0/0x110
[<c014bc4f>] sys_write+0x2f/0x50
[<c010406b>] syscall_call+0x7/0xb

Is that in lock_page again?

Hopefully there's some helpful information there. If the dump there isn't complete, can you give me an idea why it might not be? I've set the kernel buffer to 17 (128k), and the proc list was definitely small enough to fit in the buffer. When I did "dmesg -s 1000000 > foo", the first part of the file was still the original boot sequence. Any other suggestions on what to do?

Rob

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/