Re: Fw: crash on x86_64 - mm related?

From: Kai Makisara
Date: Thu Dec 01 2005 - 14:18:01 EST


On Tue, 29 Nov 2005, Andrew Morton wrote:

>
>
> Begin forwarded message:
>
> Date: Tue, 29 Nov 2005 10:44:09 -0500
> From: Ryan Richter <ryan@xxxxxxxxxxxxxxxxxxxxx>
> To: linux-kernel@xxxxxxxxxxxxxxx
> Cc: ryan@xxxxxxxxxxxxxxxxxxxxx
> Subject: crash on x86_64 - mm related?
>
>
> Hi, I booted 2.6.14.2 with the MPT fusion performance fix patch about a
> week ago on my file server. The machine crashed lat night while it was
> doing backups. You can see the voluminous kernel output below.
>
> Someone else recently had seemingly the same thing happen, but didn't
> think it was a kernel problem. You can read about it here:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338335
>
> I will reply later today with the kernel .config, right now I have to
> wait for someone to reboot the machine first.
>
> Any help would be appreciated,
> -ryan
>
> Bad page state at free_hot_cold_page (in process 'taper', page ffff81000260b6f8)
> flags:0x010000000000000c mapping:ffff8100355f1dd8 mapcount:2 count:0
> Backtrace:
>
> Call Trace:<ffffffff80159f93>{bad_page+99} <ffffffff8015a965>{free_hot_cold_page+101}
> <ffffffff80162007>{__page_cache_release+151} <ffffffff802b8fe8>{sgl_unmap_user_pages+120}
> <ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
> Trying to fix it up, but a reboot is needed
> Bad page state at free_hot_cold_page (in process 'taper', page ffff81000260b6f8)
> flags:0x010000000000081c mapping:ffff81005c0fc310 mapcount:0 count:0
> Backtrace:
>
> Call Trace:<ffffffff80159f93>{bad_page+99} <ffffffff8015a965>{free_hot_cold_page+101}
> <ffffffff80162007>{__page_cache_release+151} <ffffffff802b8fe8>{sgl_unmap
> _user_pages+120}
> <ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
> Trying to fix it up, but a reboot is needed
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at include/linux/mm.h:341
> invalid operand: 0000 [1] SMP
> CPU 1
> Modules linked in: bonding
> Pid: 2418, comm: taper Tainted: G B 2.6.14.2 #1
> RIP: 0010:[<ffffffff802b8fcd>] <ffffffff802b8fcd>{sgl_unmap_user_pages+93}
> RSP: 0018:ffff810035725e18 EFLAGS: 00010256
> RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000000000000f
> RDX: 00000000000000e0 RSI: 0000000000000001 RDI: ffff81000260b6f8
> RBP: ffff810004852068 R08: 00000000ffffffff R09: 0000000000000000
> R10: 0000000000008000 R11: 0000000000000200 R12: 0000000000000008
> R13: 0000000000000000 R14: 0000000000008000 R15: ffff810004949d10
> FS: 00002aaaab53d880(0000) GS:ffffffff804db880(0000) knlGS:00000000556b6920
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaaaac0000 CR3: 0000000035691000 CR4: 00000000000006e0
> Process taper (pid: 2418, threadinfo ffff810035724000, task ffff81017d680300)
> Stack: ffff8101423f3600 ffff810004852000 0000000000000040 0000000000008000
> ffff810004949c00 ffffffff802b48fb ffff810004852000 ffffffff802b4fb1
> ffff810000000000 ffffffff00000001
> Call Trace:<ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
>
> Code: 0f 0b 68 ba 12 3a 80 c2 55 01 f0 83 47 08 ff 0f 98 c0 84 c0
> RIP <ffffffff802b8fcd>{sgl_unmap_user_pages+93} RSP <ffff810035725e18>
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at mm/rmap.c:487
> invalid operand: 0000 [2] SMP
> CPU 1
> Modules linked in: bonding
> Pid: 2418, comm: taper Tainted: G B 2.6.14.2 #1
> RIP: 0010:[<ffffffff8016f3f7>] <ffffffff8016f3f7>{page_remove_rmap+39}
> RSP: 0018:ffff810035725ab0 EFLAGS: 00010286
> RAX: 00000000ffffffff RBX: ffff8100356976f8 RCX: ffff81000000f000
> RDX: 0000000000000000 RSI: 8000000064c69067 RDI: ffff81000260b6f8
> RBP: 00002aaaaaadf000 R08: 0000000000000000 R09: ffff81000260b688
> R10: 00000000fffffffa R11: 0000000000000000 R12: ffff810101c22380
> R13: 8000000064c69067 R14: ffff81000260b6f8 R15: 0000000000000000
> FS: 00002aaaab53d880(0000) GS:ffffffff804db880(0000) knlGS:00000000556b6920
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaaaac0000 CR3: 0000000035691000 CR4: 00000000000006e0
> Process taper (pid: 2418, threadinfo ffff810035724000, task ffff81017d680300)
> Stack: ffffffff80166ecd 00002aaaaab62000 ffff810035696aa8 00002aaaaab62000
> 00002aaaaab62000 00002aaaaab61fff ffff810035695550 00002aaaaab62000
> ffffffff80167180 ffff810035725d68
> Call Trace:<ffffffff80166ecd>{zap_pte_range+477} <ffffffff80167180>{unmap_page_range+496}
> <ffffffff801672e5>{unmap_vmas+293} <ffffffff8016cfa2>{exit_mmap+162}
> <ffffffff80131ce1>{mmput+49} <ffffffff801371c6>{do_exit+438}
> <ffffffff8010f6f1>{die+81} <ffffffff8010f9df>{do_invalid_op+159}
> <ffffffff802b8fcd>{sgl_unmap_user_pages+93} <ffffffff80381f76>{thread_return+86}
> <ffffffff802a8662>{sym_setup_data_and_start+402} <ffffffff8010e84d>{error_exit+0}
> <ffffffff802b8fcd>{sgl_unmap_user_pages+93} <ffffffff802b8fe8>{sgl_unmap_user_pages+120}
> <ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
>
[ Rest of the oopses cut ]

I have installed amanda and learned to use it enough to do experiments
with my main system. Unfortunately I have not been able to see any oopses.

My system is somewhat similar to yours but not completely. I have a single
processor system with 1 GB memory whereas your system is a dual processor
system with 5 GB memory. We both use the sym53c8xx driver to control the
tape drive.

I have tried 2.6.14.2 and 2.6.15-rc3 kernels with and without the patch I
sent earlier to the list. The first kernels did not have preemption and
NUMA support enabled but later I configured the 2.6.14.2 kernel with both
enabled. This is the nearest thing to your NUMA dual processor system but
it does not seem to be near enough.

Since I can't reproduce the problem, I have to look at the oopses more
carefully. Both yout oopses and those from
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=338335 are quite similar
at the beginning. First come one or more reports about "Bad page state at
free_hot_cold_page". The mapping_count is always two and count is zero.
This condition triggers the message.

The next thing is "Kernel BUG at include/linux/mm.h:341". This is in
put_page(struct page *page) and points to page pointer being NULL.

The third event is "Kernel BUG at mm/rmap.c:487" which results from
"BUG_ON(page_mapcount(page) < 0)". The page pointer has been used used
earlier in page_remove_rmap().

I am not an mm expert and have no idea what could cause this sequence of
events. Any ideas?

If someone has any ideas for my debugging, they are welcome. I will
continue thinking about this but now I am out of useful ideas.

--
Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/