Re: [rfc] forked kernel task and mm structures imbalanced on NUMA

From: Peter Zijlstra
Date: Tue Jun 01 2010 - 04:16:55 EST


On Tue, 2010-06-01 at 17:33 +1000, Nick Piggin wrote:
> Another problem I found when testing this patch is that the scheduler
> has some issues of its own when balancing. This is improved by
> traversing the sd groups starting from a different spot each time, so
> processes get sprinkled around the nodes a bit better.

Right, makes sense. And I think we could merge that group iteration
without much problems.

Your alternative placement for sched_exec() seems to make sense too, the
earlier we do that the better the memory allocations will be.

Your changes to sched_fork() and wake_up_new_task() made my head hurt a
bit -- but that's not your fault. I'm not quite sure why you're changing
that though.

The addition of sched_fork_suggest_cpu() to select a target node seems
to make sense, but since you then call fork balancing a second time we
have a chance of ending up on a totally different node all together.

So I think it would make sense to rework the fork balancing muck to be
called only once and stick with its decision.

One thing that would make the whole fork path much easier is fully
ripping out that child_runs_first mess for CONFIG_SMP, I think its been
disabled by default for long enough, and its always been broken in the
face of fork balancing anyway.

So basically we have to move fork balancing back to sched_fork(), I'd
have to again look at wth happens to ->cpus_allowed, but I guess it
should be fixable, and I don't think we should care overly much about
cpu-hotplug.


A specific code comment:

> @@ -2550,14 +2561,16 @@ void wake_up_new_task(struct task_struct
> * We set TASK_WAKING so that select_task_rq() can drop rq->lock
> * without people poking at ->cpus_allowed.
> */
> - cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> - set_task_cpu(p, cpu);
> -
> - p->state = TASK_RUNNING;
> - task_rq_unlock(rq, &flags);
> + if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) {
> + p->state = TASK_WAKING;
> + cpu = select_task_rq(rq, p, SD_BALANCE_FORK, 0);
> + set_task_cpu(p, cpu);
> + p->state = TASK_RUNNING;
> + task_rq_unlock(rq, &flags);
> + rq = task_rq_lock(p, &flags);
> + }
> #endif

That's iffy because p->cpus_allowed isn't stable when we're not holding
the task's current rq->lock or p->state is not TASK_WAKING.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/