Re: frequent lockups in 3.18rc4

From: Ingo Molnar
Date: Sat Dec 13 2014 - 03:20:08 EST



* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, Dec 12, 2014 at 10:54 AM, Dave Jones <davej@xxxxxxxxxx> wrote:
>
> >
> > Something that's still making me wonder if it's some kind of
> > hardware problem is the non-deterministic nature of this bug.
>
> I'd expect it to be a race condition, though. Which can easily
> cause these kinds of issues, and the timing will be pretty
> random even if the load is very regular.
>
> And we know that the scheduler has an integer overflow under
> Sasha's loads, although I didn't hear anything from Ingo and
> friends about it. Ingo/Peter, you were cc'd on that report,
> where at least one of the multiplcations in wake_affine() ended
> up overflowing..

Just to make sure, is there any other wake_affine report other
than the one in this thread? (I tried a wake_affine full text
search on my inbox and didn't find anything that appeared
relevant.)

> Some scheduler thing that overflows only under heavy load, and
> screws up scheduling could easily account for the RCU thread
> thing. I see it *less* easily accounting for DaveJ's case,
> though, because the watchdog is running at RT priority, and the
> scheduler would have to screw up much more to then not schedule
> an RT task, but..

Yeah, the RT scheduler is harder (but not impossible) to confuse
due to its simplicity, but scheduler counts overflowing could
definitely cause all sorts of trouble and make debugging harder,
so we want to fix it regardless of its likelihood of causing
lockups.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/