Re: frequent lockups in 3.18rc4

From: Thomas Gleixner
Date: Fri Dec 19 2014 - 20:06:45 EST


On Fri, 19 Dec 2014, Chris Mason wrote:
> On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > But at the very end this would be detected by the runtime check of the
> > hrtimer interrupt, which does not trigger. And it would trigger at
> > some point as ALL cpus including CPU0 in that trace dump make
> > progress.
>
> I'll admit that at some point we should be hitting one of the WARN or BUG_ON,
> but it's possible to thread that needle and corrupt the timer list, without
> hitting a warning (CPU 1 in my example has to enqueue last). Once the rbtree
> is hosed, it can go forever. Probably not the bug we're looking for, but
> still suspect in general.

I surely have a close look at that, but in that case we get out of
that state later on and I doubt that we have

A) a corruption of the rbtree
B) a self healing of the rbtree afterwards

I doubt it, but who knows.

Though even if A & B would happen we would still get the 'hrtimer
interrupt took a gazillion of seconds' warning because CPU0 definitely
leaves the timer interrupt at some point otherwise we would not see
backtraces from usb, userspace and idle later on.

Thanks,

tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/