Re: [PATCH] sched: Fix lost reschedule in __cond_resched()

From: Frederic Weisbecker
Date: Sun Dec 14 2014 - 13:04:28 EST


On Sat, Dec 13, 2014 at 08:36:34AM +0100, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> > I'm also not sure if the bug ever happens with preemption
> > disabled. Sasha, was that you who reported that you cannot
> > reproduce it without preemption? It strikes me that there's a
> > race condition in __cond_resched() wrt preemption, for example:
> > we do
> >
> > __preempt_count_add(PREEMPT_ACTIVE);
> > __schedule();
> > __preempt_count_sub(PREEMPT_ACTIVE);
> >
> > and in between the __schedule() and __preempt_count_sub(), if
> > an interrupt comes in and wakes up some important process, it
> > won't reschedule (because preemption is active), but then we
> > enable preemption again and don't check whether we should
> > reschedule (again), and we just go on our merry ways.
>
> Indeed, that's a really good find regardless of whether it's the
> source of these lockups - the (untested) patch below ought to
> cure that.
>
> > Now, I don't see how that could really matter for a long time -
> > returning to user space will check need_resched, and sleeping
> > will obviously force a reschedule anyway, so these kinds of
> > races should at most delay things by just a tiny amount, but
> > maybe there is some case where we screw up in a bigger way. So
> > I do *not* believe that the one in __cond_resched() matters,
> > but I'm giving it as an example of the kind of things that
> > could go wrong.
>
> (as you later note) NOHZ is somewhat special in this regard,
> because there we try really hard not to run anything
> periodically, so a lost reschedule will matter more.
>
> But ... I'd be surprised if this patch made a difference: it
> should normally not be possible to go idle with tasks on the
> runqueue (even with this bug present), and with at least one busy
> task on the CPU we get the regular scheduler tick which ought to
> hide such latencies.
>
> It's nevertheless a good thing to fix, I'm just not sure it's the
> root cause of the observed lockup here.
>
> Thanks,
>
> Ingo
>
> --
>
> Reported-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bb398c0c5f08..532809aa0544 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4207,6 +4207,8 @@ static void __cond_resched(void)
> __preempt_count_add(PREEMPT_ACTIVE);
> __schedule();
> __preempt_count_sub(PREEMPT_ACTIVE);
> + if (need_resched())
> + __schedule();
> }

Nice catch! This indeed matters a lot for full nohz where a lost reschedule
interrupt might be ignored and not fixed with a near tick. Although even if
it is fixed by a tick, a missed reschedule delayed by HZ involves latency issue.

Anyway, probably the above __schedule() should stay as a preemption point
to make sure that a TASK_[UN]INTERRUPTIBLE is handled as expected and avoids
early task deactivation.

Such as:

Signed-off-by: Frederic Weisbecker <fweisbec@xxxxxxxxx>
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240157c..6e942f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2922,6 +2922,21 @@ void __sched schedule_preempt_disabled(void)
preempt_disable();
}

+static void __preempt_schedule(void)
+{
+ do {
+ __preempt_count_add(PREEMPT_ACTIVE);
+ __schedule();
+ __preempt_count_sub(PREEMPT_ACTIVE);
+
+ /*
+ * Check again in case we missed a preemption opportunity
+ * between schedule and now.
+ */
+ barrier();
+ } while (need_resched());
+}
+
#ifdef CONFIG_PREEMPT
/*
* this is the entry point to schedule() from in-kernel preemption
@@ -2937,17 +2952,7 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
if (likely(!preemptible()))
return;

- do {
- __preempt_count_add(PREEMPT_ACTIVE);
- __schedule();
- __preempt_count_sub(PREEMPT_ACTIVE);
-
- /*
- * Check again in case we missed a preemption opportunity
- * between schedule and now.
- */
- barrier();
- } while (need_resched());
+ __preempt_schedule();
}
NOKPROBE_SYMBOL(preempt_schedule);
EXPORT_SYMBOL(preempt_schedule);
@@ -4249,9 +4254,7 @@ SYSCALL_DEFINE0(sched_yield)

static void __cond_resched(void)
{
- __preempt_count_add(PREEMPT_ACTIVE);
- __schedule();
- __preempt_count_sub(PREEMPT_ACTIVE);
+ __preempt_schedule();
}

int __sched _cond_resched(void)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/