Re: [patch v2 1/2] sched: check for prev_cpu == this_cpu beforecalling wake_affine()

From: Mike Galbraith
Date: Thu Apr 01 2010 - 01:32:36 EST


On Wed, 2010-03-31 at 16:47 -0700, Suresh Siddha wrote:

> Issues in the current select_idle_sibling() logic in select_task_rq_fair()
> in the context of a task wake-up:
>
> a) Once we select the idle sibling, we use that domain (spanning the cpu that
> the task is currently woken-up and the idle sibling that we found) in our
> wake_affine() decisions. This domain is completely different from the
> domain(we are supposed to use) that spans the cpu that the task currently
> woken-up and the cpu where the task previously ran.

Why does that matter? If we find an idle shared cache cpu before we hit
the spanning domain, we don't use affine_sd other than maybe (unlikely)
for updating group scheduler shares.

> b) We do select_idle_sibling() check only for the cpu that the task is
> currently woken-up on. If select_task_rq_fair() selects the previously run
> cpu for waking the task, doing a select_idle_sibling() check
> for that cpu also helps and we don't do this currently.

True, but that costs too. Those idle checks aren't cheap.

> c) In the scenarios where the cpu that the task is woken-up is busy but
> with its HT siblings are idle, we are selecting the task be woken-up
> on the idle HT sibling instead of a core that it previously ran
> and currently completely idle. i.e., we are not taking decisions based on
> wake_affine() but directly selecting an idle sibling that can cause
> an imbalance at the SMT/MC level which will be later corrected by the
> periodic load balancer.

Yes, the pressing decision for this one wakeup is can we wake to a
shared cache and thus avoid cache misses.

IMHO, the point of the affinity decision isn't instant perfect balance,
it's cache affinity if at all possible without wrecking balance. Load
balancing moves tasks for optimal CPU utilization, tasks waking each
other pull to a shared domain.. a tug-of-war that balances buddies over
time. wake_affine()'s job is only to say "no, leave it where it was for
now". I don't see any reason to ask wake_affine()'s opinion about an
idle CPU. We paid for idle shared cache knowledge.

We certainly wouldn't want to leave the wakee on it's previous CPU only
because that CPU is idle, it would have to be idle and sharing cache.

That said, Nehalem may ramp better with select_idle_sibling() turned off
at the HT level, and ramp was it's motivation. Maybe you could continue
checking until out of shared cache country, but that's more expensive.

The logic may not be perfect, but it really needs to become cheaper, not
more expensive.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/