Re: [patch] sched: fix b5d9d734 blunder in task_new_fair()

From: Mike Galbraith
Date: Fri Nov 27 2009 - 03:57:53 EST

Next message: Stefani Seibold: "[PATCH] fix PPC floating point debug"
Previous message: Russell King - ARM Linux: "Re: [patch] spitz: better voltage-to-percentage conversion"
In reply to: Mike Galbraith: "Re: [patch] sched: fix b5d9d734 blunder in task_new_fair()"
Next in thread: Mike Galbraith: "Re: [patch] sched: fix b5d9d734 blunder in task_new_fair()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 2009-11-27 at 09:45 +0100, Mike Galbraith wrote:
> Off list

Guess not, mouse went click on the way to save menu. Oh well, was going
to spare thousands of mailboxes another confused /me message, but too
late now.

> On Thu, 2009-11-26 at 17:26 +0100, Peter Zijlstra wrote:
> > On Sun, 2009-11-22 at 13:07 +0100, Mike Galbraith wrote:
> > > @@ -2589,16 +2588,10 @@ void sched_fork(struct task_struct *p, i
> > > */
> > > p->prio = current->normal_prio;
> > >
> > > - if (!rt_prio(p->prio))
> > > + if (!task_has_rt_policy(p))
> > > p->sched_class = &fair_sched_class;
> > >
> > > -#ifdef CONFIG_SMP
> > > - cpu = p->sched_class->select_task_rq(p, SD_BALANCE_FORK, 0);
> > > -#endif
> > > - local_irq_save(flags);
> > > - update_rq_clock(cpu_rq(cpu));
> > > - set_task_cpu(p, cpu);
> > > - local_irq_restore(flags);
> > > + __set_task_cpu(p, cpu);
> >
> > OK, so I figured out why it was in sched_fork() and not in
> > wake_up_new_task().
> >
> > It is because in sched_fork() the new task isn't in the tasklist yet and
> > can therefore not race with any other migration logic.
>
> All the raciness I'm fretting over probably just doesn't matter much.
> Things aren't exploding. Maybe min_vruntime is the only thing I should
> be worrying about. That concern is in-flight deltas of SCHED_IDLE
> magnitude, ie cross cpu "fuzziness" on a very large scale.
>
> However :-/ (aw sh*t, here i go again. aaaaOOOOOgah! dive! dive!;)
>
> WRT affinity, sched_class, nice level fretting, that can all change from
> userland at any instant that you do not hold the task's runqueue lock
> and the tasklist lock is not held by somebody to keep them from getting
> a task reference to start the ball rolling. As soon as you drop the
> runqueue lock, userland can acquire, and change whatever it likes under
> you, so afaikt, we can call the wrong sched_class method etc etc.
>
> 3f029d3 agrees fully wrt sched_class at least:
> In addition, we fix a race condition where we try to access
> current->sched_class without holding the rq->lock. This is
> technically racy, as the sched-class could change out from
> under us. Instead, we reference the per-rq post_schedule
> variable with the runqueue unlocked, but with preemption
> disabled to see if we need to reacquire the rq->lock.
>
> The only thing that really changes with the unlocked _rummaging_ is that
> we now can't count on nr_running/load on the task's current runqueue,
> sched_class etc while you're rummaging, ALL state is fuzzy, instead of
> only most.
>
> However, I don't think we can even count on the task remaining untouched
> while in TASK_WAKING state, and that might be a bigger deal.
>
> afaikt, userland can migrate the task you're in the process of waking
> while you're off rummaging around looking for a place to put it, like
> so: Wakee is on the tasklist, can be accessed by userland. We wouldn't
> be in ttwu either were it not. We're waking, we set task state to
> TASK_WAKING, release the lock, userland acquires, nobody but ttwu has
> ever heard of a TASK_WAKING, so it happily changes task's affinity,
> migrates the sleeping task to the one and only (pins) correct runqueue,
> sets task cpu etc, releases the lock, and goes home. We finish
> rummaging, do NOT check for an intervening affinity change, instead, we
> do an unlocked scribble over what userland just wrote, resetting cpu and
> vruntime to a now illegal cpu, and activate. I'm not seeing any
> inhibitor for this scenario.
>
> When I moved fork balancing runqueue selection to the much more logical
> wakeup and enqueue time, vs copy and fiddle time, I didn't fix anything,
> I merely duplicated the races that are now in ttwu.
>
> No matter were we do the selection, we can race with userland if the
> darn thing isn't locked all the while. With .31 ttwu locking, there is
> no race, because nobody can get to the task struct. If target cpu
> changes via rq selection, we set cpu, _then_ unlock, at which point
> userland or whomever may override _our_ decision, but we never write
> after re-acquiring, so intervening changes, if any, stay intact.
>
> With an exec, userland policy/affinity change will deactivate/activate
> or do a migration call. We don't have the thing locked while we're
> rummaging, so what keeps sched_class from changing after we evaluated,
> so we call the wrong method, and then do our own migration call?
>
> /me is still pretty befuddled, and haven't even crawled over PI.
>
> I flat don't see how we can do this race free, unless we put every task
> in some untouchable state while we're rummaging, and teach everything
> having to do with migration about that state.
>
> -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Stefani Seibold: "[PATCH] fix PPC floating point debug"
Previous message: Russell King - ARM Linux: "Re: [patch] spitz: better voltage-to-percentage conversion"
In reply to: Mike Galbraith: "Re: [patch] sched: fix b5d9d734 blunder in task_new_fair()"
Next in thread: Mike Galbraith: "Re: [patch] sched: fix b5d9d734 blunder in task_new_fair()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]