Re: [stable] 2.6.32.21 - uptime related crashes?

From: Nikola Ciprich
Date: Fri Apr 29 2011 - 06:04:46 EST


(another CC added)

Hello Willy!

I made some statistics of our servers regarding kernel version and uptime.
Here are some my thoughts:
- I'm 100% sure this problem wasn't present in kernels <= 2.6.30.x (we've got a lot of boxes with uptimes >600days)
- I'm 90% sure this problem also wasn't present in 2.6.32.16 (we've got 6 boxes running for 235 to 280days)

What I'm not sure is, whether this is present in 2.6.19, I have:
2 boxes running 2.6.32.19 for 238days and one 2.6.32.20 for 216days.
I also have a bunch ov 2.6.32.23 boxes, which are now getting close to 200days uptime.
But I suspect this really is first problematic version, more on it later.
First regarding Your question about CONFIG_HZ - we use 250HZ setting, which leads me to following:
250 * 60 * 60 * 24 * 199 = 4298400000 which is value a little over 2**32! So maybe some unsingned long variable
might overflow? Does this make sense?

And to my suspicion about 2.6.32.19, there is one commit which maybe is related:

commit 0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
Author: Hidetoshi Seto <seto.hidetoshi@xxxxxxxxxxxxxx>
Date: Wed Dec 2 17:28:07 2009 +0900

sched, cputime: Introduce thread_group_times()

This is a real fix for problem of utime/stime values decreasing
described in the thread:

http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.
.
.
.

I haven't studied this into detail yet, but it seems to me it might really be related. Hidetoshi-san - do You have some opinion about this?
Could this somehow either create or invoke the problem with overflow of some variable which would lead to division by zero or similar problems?

Any other thoughts?

best regards

nik





On Thu, Apr 28, 2011 at 08:34:34PM +0200, Willy Tarreau wrote:
> Hello Nikola,
>
> On Thu, Apr 28, 2011 at 10:26:25AM +0200, Nikola Ciprich wrote:
> > Hello everybody,
> >
> > I'm trying to solve strange issue, today, my fourth machine running 2.6.32.21 just crashed. What makes the cases similar, apart fromn same kernel version is that all boxes had very similar uptimes: 214, 216, 216, and 224 days. This might just be a coincidence, but I think this might be important.
>
> Interestingly, one of our customers just had two machines who crashed
> yesterday after 212 days and 212+20h respectively. They were running
> debian's 2.6.32-bpo.5-amd64 which is based on 2.6.32.23 AIUI.
>
> The crash looks very similar to the following bug which we have updated :
>
> https://bugzilla.kernel.org/show_bug.cgi?id=16991
>
> (bugzilla doesn't appear to respond as I'm posting this mail).
>
> The top of your ouput is missing. In our case as in the reports on the bug
> above, there was a divide by zero error. Did you happen to spot this one
> too, or do you just not know ? I observe "divide_error+0x15/0x20" in one
> of your reports, so it's possible that it matches the same pattern at least
> for one trace. Just in case, it would be nice to feed the bugzilla entry
> above.
>
> > Unfortunately I only have backtraces of two crashes (and those are trimmed, sorry), and they do not look as similar as I'd like, but still maybe there is something in common:
> >
> > [<ffffffff81120cc7>] pollwake+0x57/0x60
> > [<ffffffff81046720>] ? default_wake_function+0x0/0x10
> > [<ffffffff8103683a>] __wake_up_common+0x5a/0x90
> > [<ffffffff8103a313>] __wake_up+0x43/0x70
> > [<ffffffffa0321573>] process_masterspan+0x643/0x670 [dahdi]
> > [<ffffffffa0326595>] coretimer_func+0x135/0x1d0 [dahdi]
> > [<ffffffff8105d74d>] run_timer_softirq+0x15d/0x320
> > [<ffffffffa0326460>] ? coretimer_func+0x0/0x1d0 [dahdi]
> > [<ffffffff8105690c>] __do_softirq+0xcc/0x220
> > [<ffffffff8100c40c>] call_softirq+0x1c/0x30
> > [<ffffffff8100e3ba>] do_softirq+0x4a/0x80
> > [<ffffffff810567c7>] irq_exit+0x87/0x90
> > [<ffffffff8100d7b7>] do_IRQ+0x77/0xf0
> > [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
> > <EUI> [<ffffffffa019e556>] ? acpi_idle_enter_bm+0x273/0x2a1 [processor]
> > [<ffffffffa019e54c>] ? acpi_idle_enter_bm+0x269/0x2a1 [processor]
> > [<ffffffff81280095>] ? cpuidle_idle_call+0xa5/0x150
> > [<ffffffff8100a18f>] ? cpu_idle+0x4f/0x90
> > [<ffffffff81323c95>] ? rest_init+0x75/0x80
> > [<ffffffff81582d7f>] ? start_kernel+0x2ef/0x390
> > [<ffffffff81582271>] ? x86_64_start_reservations+0x81/0xc0
> > [<ffffffff81582386>] ? x86_64_start_kernel+0xd6/0x100
> >
> > this box (actually two of the crashed ones) is using dahdi_dummy module to generate timing for asterisk SW pbx, so maybe it's related to it.
> >
> >
> > [<ffffffff810a5063>] handle_IRQ_event+0x63/0x1c0
> > [<ffffffff810a71ae>] handle_edge_irq+0xce/0x160
> > [<ffffffff8100e1bf>] handle_irq+0x1f/0x30
> > [<ffffffff8100d7ae>] do_IRQ+0x6e/0xf0
> > [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
> > <EUI> [<ffffffff8133?f?f>] ? _spin_un1ock_irq+0xf/0x40
> > [<ffffffff81337f79>] ? _spin_un1ock_irq+0x9/0x40
> > [<ffffffff81064b9a>] ? exit_signals+0x8a/0x130
> > [<ffffffff8105372e>] ? do_exit+0x7e/0x7d0
> > [<ffffffff8100f8a7>] ? oops_end+0xa7/0xb0
> > [<ffffffff8100faa6>] ? die+0x56/0x90
> > [<ffffffff8100c810>] ? do_trap+0x130/0x150
> > [<ffffffff8100ccca>] ? do_divide_error+0x8a/0xa0
> > [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> > [<ffffffff8104400b>] ? cpuacct_charge+0x6b/0x90
> > [<ffffffff8100c045>] ? divide_error+0x15/0x20
> > [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> > [<ffffffff8103cfff>] ? find_busiest_group+0x1af/0xa00
> > [<ffffffff81335483>] ? thread_return+0x4ce/0x7bb
> > [<ffffffff8133bec5>] ? do_nanosleep+0x75/0x30
> > [<ffffffff810?1?4e>] ? hrtimer_nanosleep+0x9e/0x120
> > [<ffffffff810?08f0>] ? hrtimer_wakeup+0x0/0x30
> > [<ffffffff810?183f>] ? sys_nanosleep+0x6f/0x80
> >
> > another two don't use it. only similarity I see here is that it seems to be IRQ handling related, but both issues don't have anything in common.
> > Does anybody have an idea on where should I look? Of course I should update all those boxes to (at least) latest 2.6.32.x, and I'll do it for sure, but still I'd first like to know where the problem was, and if it has been fixed, or how to fix it...
> > I'd be gratefull for any help...
>
> There were quite a bunch of scheduler updates recently. We may be lucky and
> hope for the bug to have vanished with the changes, but we may as well see
> the same crash in 7 months :-/
>
> My coworker Hervé (CC'd) who worked on the issue suggests that we might have
> something which goes wrong past a certain uptime (eg: 212 days), which needs
> a special event to be triggered (I/O, process exiting, etc...). I think this
> makes quite some sense.
>
> Could you check your CONFIG_HZ so that we could convert those uptimes to
> jiffies ? Maybe this will ring a bell in someone's head :-/
>
> Best regards,
> Willy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.: +420 596 603 142
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------

Attachment: pgp00000.pgp
Description: PGP signature