Re: Debugging Thinkpad T430s occasional suspend failure.

From: Linus Torvalds
Date: Fri Feb 15 2013 - 13:34:39 EST


On Fri, Feb 15, 2013 at 9:44 AM, Paul E. McKenney
<paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
> This commit was designed to increase the probability of hitting the
> races described in http://lwn.net/Articles/453002/. These races result
> in deadlocks involving the runqueue lock (and perhaps also the priority
> inheritance locks). And yes, I most certainly should have described
> this in the commit message. :-(

Ugh. That particular race seems to be because the softirq handling is
just crazy, and does the "wakeup_softirqd()" form interrupt context,
BUT HAS SPECIFICALLY BROKEN THE IRQ COUNTING!

Because it claims to do it from softirq context, which is pure
garbage. It's not actually in softirq context.

The whole hardirq -> softirq transition seems stupid. I'm sure I made
some serious mistake in cleaning it up, and there's probably some
missed tracepoint (or perhaps screwed-up lockdep annotation), but I
think the hardirq -> softirq preempt thing shoudl be done as an atomic
preempt downgrade, so that we never have a window of "uhhuh, another
interrupt can come in between and see us as being in neither). And the
wakeup_softirqd should be done without playing with preempt count at
all.

Something like this ENTIRELY UNTESTED patch.

Note: I doubt this patch affects Dave's issue at all, I just started
looking at that do_softirq code when I read your bug explanation.

Adding random people for kernel/softirq.c to the participants list.
Comments about the patch? Do note that it's entirely untested, so
consider it more a RFD than a real patch.. It looks like it adds a lot
of lines, but most of it is for comments and simplification of the
logic.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/