[PROBLEM] Kernel crashes with 2.6.25-rc1 and above

From: Mihai Moldovan
Date: Thu Aug 14 2008 - 13:14:21 EST


Dear Kernel Hackers,

as indicated in the Subject line, I've got some sort of problem. All Kernel above (and equal) 2.6.25-rc1 are crashing on my Notebook after a *random* time, thus preventing me of using them.

When I first noticed that problem, I tried to get some usable result by bisecting the Kernel, but after 2 weeks of bisecting only, I've given up.

My machine locks up after a random amount of uptime, and this is a real problem. Before bisecting, I thought that this time would be at most 30 minutes (and in fact, newer Kernels seem to crash more rapid than older ones), but while bisecting, I've come across the phenomena, that it might take as well 2 or 4 hours for the box to crash. This in fact means, that all my bisecting efforts are for the nuts, because I might have marked versions as good, while they indeed were "bad" (I've marked all Kernels "good" which still worked after 1 hour uptime, later I changed to 2 hours, but I still...)

All in all, the problem is that I cannot really say whether a version is good or bad, but after letting the box run for x hours... and x is undefined. It might be a safe thing to let the box run 24 hours for each Kernel and then mark the version as good or bad, but given that I will have to test 13 or more Kernels this will make 2 weeks of testing Kernels only, and I hope you can bear with me, this is really a lot of time.

Now, describing what happens is simple: the machine will totally lock itself. No input or output is working anymore, the Kernel will not respond to SysRq presses and also not respond to ping anymore. Due to this fact, also no panic message is logged and honestly, I have not seen any this whole time either.

I really am confused about this.

The only messages I could get were "Hangcheck: hangcheck value past margin!", "rtc: lost y interrupts" (y is quite random as well) and this one, when running hwclock:

------------[ cut here ]------------
WARNING: at kernel/lockdep.c:2033 trace_hardirqs_on+0x9b/0x10d()
Modules linked in: irtty_sir sir_dev ipw2200 yenta_socket rsrc_nonstatic pcmcia_core tifm_7xx1 tifm_core sky2
Pid: 2704, comm: hwclock Not tainted 2.6.24-uvesafb-tuxonice-squashFS3.2-04814-gd2e626f #1
[<c01205ec>] warn_on_slowpath+0x41/0x51
[<c010b376>] ? save_stack_address+0x0/0x28
[<c013a2e1>] ? check_usage_forwards+0x19/0x3b
[<c013b726>] ? __lock_acquire+0xac2/0xb0a
[<c03942db>] ? ata_qc_complete+0x115/0x128
[<c0108c60>] ? native_sched_clock+0x8b/0x9f
[<c0138b89>] ? put_lock_stats+0xd/0x21
[<c05362ec>] ? _spin_unlock_irq+0x22/0x42
[<c013a83f>] trace_hardirqs_on+0x9b/0x10d
[<c05362ec>] _spin_unlock_irq+0x22/0x42
[<c0114829>] hpet_rtc_interrupt+0xdf/0x290
[<c01509d8>] handle_IRQ_event+0x1a/0x46
[<c0151832>] handle_edge_irq+0xbe/0xff
[<c0151774>] ? handle_edge_irq+0x0/0xff
[<c0106f09>] do_IRQ+0xab/0xd4
[<c010555a>] common_interrupt+0x2e/0x34
=======================
---[ end trace 3f0a8d3fa0ba549b ]---


I *suspect* that the RTC subsystem _might_ be related to my problem, because all those warning messages came up with at some point of 2.6.24 first, but I cannot really state that they are the evil making my machine crash.

At this point, I am out of ideas and hope that some experienced person can help me.

Best regards,


Mihai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/