Re: [RFC][PATCH 06/22] sched: SCHED_DEADLINE handles spacialkthreads

From: Peter Zijlstra
Date: Sat Nov 13 2010 - 15:32:26 EST


On Sat, 2010-11-13 at 20:58 +0100, Oleg Nesterov wrote:
> On 11/13, Peter Zijlstra wrote:
> >
> > Something like so?.. hasn't even seen a compiler yet but one's got to do
> > something to keep the worst bore of saturday night telly in check ;-)
>
> Yes, I _think_ this all can work (and imho makes a lot of sense
> if it works).
>
> quick and dirty review below ;)
>
> > struct take_cpu_down_param {
> > - struct task_struct *caller;
> > unsigned long mod;
> > void *hcpu;
> > };
> > @@ -208,11 +207,6 @@ static int __ref take_cpu_down(void *_pa
> >
> > cpu_notify(CPU_DYING | param->mod, param->hcpu);
> >
> > - if (task_cpu(param->caller) == cpu)
> > - move_task_off_dead_cpu(cpu, param->caller);
> > - /* Force idle task to run as soon as we yield: it should
> > - immediately notice cpu is offline and die quickly. */
> > - sched_idle_next();
>
> Yes. but we should remove "while (!idle_cpu(cpu))" from _cpu_down().

Right, I think we should replace that with something like BUG_ON(!
idle_cpu(cpu)); Since we migrated everything away during the stop
machine, the cpu should be idle after it.

> > @@ -2381,18 +2381,15 @@ static int select_fallback_rq(int cpu, s
> > return dest_cpu;
> >
> > /* No more Mr. Nice Guy. */
> > - if (unlikely(dest_cpu >= nr_cpu_ids)) {
> > - dest_cpu = cpuset_cpus_allowed_fallback(p);
> > - /*
> > - * Don't tell them about moving exiting tasks or
> > - * kernel threads (both mm NULL), since they never
> > - * leave kernel.
> > - */
> > - if (p->mm && printk_ratelimit()) {
> > - printk(KERN_INFO "process %d (%s) no "
> > - "longer affine to cpu%d\n",
> > - task_pid_nr(p), p->comm, cpu);
> > - }
> > + dest_cpu = cpuset_cpus_allowed_fallback(p);
> > + /*
> > + * Don't tell them about moving exiting tasks or
> > + * kernel threads (both mm NULL), since they never
> > + * leave kernel.
> > + */
> > + if (p->mm && printk_ratelimit()) {
> > + printk(KERN_INFO "process %d (%s) no longer affine to cpu%d\n",
> > + task_pid_nr(p), p->comm, cpu);
> > }
>
> Hmm. I was really puzzled until I realized this is just cleanup,
> we can't reach this point if dest_cpu < nr_cpu_ids.

Right.. Noticed that when I read that code, though I might as well fix
it up.

> > +static void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p)
> > {
> > struct rq *rq = cpu_rq(dead_cpu);
> > + int needs_cpu, uninitialized_var(dest_cpu);
> >
> > - /* Must be exiting, otherwise would be on tasklist. */
> > - BUG_ON(!p->exit_state);
> > -
> > - /* Cannot have done final schedule yet: would have vanished. */
> > - BUG_ON(p->state == TASK_DEAD);
> > -
> > - get_task_struct(p);
> > + needs_cpu = (task_cpu(p) == dead_cpu) && (p->state != TASK_WAKING);
> > + if (needs_cpu)
> > + dest_cpu = select_fallback_rq(dead_cpu, p);
> > + raw_spin_unlock(&rq->lock);
>
> Probably we do not need any checks. This task was picked by
> ->pick_next_task(), it should have task_cpu(p) == dead_cpu ?

Right, we can drop those checks, its unconditionally true.

> But. I think there is a problem. We should not migrate current task,
> stop thread, which does the migrating. At least, sched_stoptask.c
> doesn't implement ->enqueue_task() and we can never wake it up later
> for kthread_stop().

Hrm, right, so while the migration thread isn't actually on any rq
structure as such, pick_next_task() will return it.. need to come up
with a way to skip it.

As to current, take_cpu_down() is actually migrating current away before
this patch, so I simply included current in the
CPU_DYING->migrate_tasks() loop and removed the special case from
take_cpu_down().

> Perhaps migrate_tasks() should do for_each_class() by hand to
> ignore stop_sched_class. But then _cpu_down() should somewhow
> ensure the stop thread on the dead CPU is already parked in
> schedule().

Well, since we're in stop_machine all cpus but the cpu that is executing
is stuck in the stop_machine_cpu_stop() loop, in both cases we could
simply fudge the pick_next_task_stop() condition (eg. set rq->stop =
NULL) while doing that loop, and restore it afterwards, nothing will hit
schedule() while we're there.

> > - case CPU_DYING_FROZEN:
> > /* Update our root-domain */
> > raw_spin_lock_irqsave(&rq->lock, flags);
> > if (rq->rd) {
> > BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
> > set_rq_offline(rq);
> > }
> > + migrate_tasks(cpu);
> > + BUG_ON(rq->nr_running != 0);
> > raw_spin_unlock_irqrestore(&rq->lock, flags);
>
> Probably we don't really need rq->lock. All cpus run stop threads.

Right, but I was worried about stuff that relied on lockdep state like
the rcu lockdep stuff.. and taking the lock doesn't hurt.

> I am not sure about rq->idle, perhaps it should be deactivated.
> I don't think we should migrate it.

Ah, I think the !nr_running check will bail before we end up selecting
the idle thread.

> What I never understood is the meaning of play_dead/etc. If we
> remove sched_idle_next(), who will do that logic? And how the
> idle thread can call idle_task_exit() ?

Well, since we'll have migrated every task on that runqueue (except the
migration thread), the only runnable task left (once the migration
thread stops running) is the idle thread, so it should be implicit.

As to play_dead():

cpu_idle()
if (cpu_is_offline(smp_processor_id()))
play_dead()
native_play_dead() /* did I already say I detest paravirt? */
play_dead_common()
idle_task_exit();
local_irq_disable();
tboot/mwait/hlt

It basically puts the cpu to sleep with IRQs disabled, needs special
magic to wake it back up.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/