Re: [PATCH v3 3/3] sched, x86: Check that we're on the right stack in schedule and __might_sleep

From: Andy Lutomirski
Date: Mon Feb 29 2016 - 00:28:45 EST


On Wed, Nov 19, 2014 at 11:44 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Nov 19, 2014 at 11:29 AM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
>>
>> The exception handlers which use the IST stacks don't necessarily
>> set irq count. Maybe they should.
>
> Hmm. I think they should. Since they clearly must not schedule, as
> they use a percpu stack.
>
> Which exceptions use IST?
>
> [ grep grep ]
>
> Looks like stack, doublefault, nmi, debug and mce. And yes, I really
> think they should all raise the irq count if they don't already.
> Rather than add random arch-specific "let's check that we're on the
> right stack" code to the might-sleep stuff, just use the one we have.
>

Resurrecting an old thread:

The outcome of this discussion was that ist_enter now raises
HARDIRQ_COUNT. I think this is causing a problem. If a user program
enables TF, it generates a bunch of debug exceptions. The handlers
raise the IRQ count and do stuff, and apparently some of that stuff
can raise a softirq. (I have no idea where the softirq is being
raised.) The softirq code notices that we're in_interrupt and doesn't
wake ksoftirqd because it thinks we're about to exit the interrupt and
process the softirq. But we don't, which causes occasional warnings
and confuses things (and me!).

So how do we fix it? If we stop raising HARDIRQ_COUNT (and apply
$SUBJECT?), then raise_softirq will wake ksoftirqd and life is good.
But this seems a bit silly, since, if we entered the ist exception
handler from a context with irqs on and softirqs enabled, we *could*
plausibly handle the softirq right away -- we're on an essentially
empty stack. (Of course, it's a *small* stack, since it could be the
IST stack.)

Or we could just let ksoftirqd do its thing and stop raising
HARDIRQ_COUNT. We could add a new preempt count field just for IST
(yuck). We could try to hijack a different preempt count field
(NMI?). But I kind of like the idea of just reinstating the original
patch of explicitly checking that we're on a safe stack in schedule
and __might_sleep, since that is the actual condition we care about.

--Andy