Re: [RFC PATCH v3 0/3] sched: simplify the select_task_rq_fair()

From: Michael Wang
Date: Thu Feb 21 2013 - 04:09:04 EST

On 02/21/2013 04:10 PM, Mike Galbraith wrote:
> On Thu, 2013-02-21 at 15:00 +0800, Michael Wang wrote:
>> On 02/21/2013 02:11 PM, Mike Galbraith wrote:
>>> On Thu, 2013-02-21 at 12:51 +0800, Michael Wang wrote:
>>>> On 02/20/2013 06:49 PM, Ingo Molnar wrote:
>>>> [snip]
>> [snip]
>>>> if wake_affine()
>>>> new_cpu = select_idle_sibling(curr_cpu)
>>>> else
>>>> new_cpu = select_idle_sibling(prev_cpu)
>>>> return new_cpu
>>>> Actually that doesn't make sense.
>>>> I think wake_affine() is trying to check whether move a task from
>>>> prev_cpu to curr_cpu will break the balance in affine_sd or not, but why
>>>> won't break balance means curr_cpu is better than prev_cpu for searching
>>>> the idle cpu?
>>> You could argue that it's impossible to break balance by moving any task
>>> to any idle cpu, but that would mean bouncing tasks cross node on every
>>> wakeup is fine, which it isn't.
>> I don't get it... could you please give me more detail on how
>> wake_affine() related with bouncing?
> If we didn't ever ask if it's ok, we'd always pull, and stack load up on
> one package if there's the tiniest of holes to stuff a task into,
> periodic balance forcibly rips buddies back apart, repeat. At least
> with wake_affine() in the loop, there's somewhat of a damper.

I think I got your point, a question about the possibility to locate an
idle cpu which belong to different package from prev_cpu, and how
wake_affine() help on it.

Old logical require prev_cpu and curr_cpu to be affine, which means they
share caches on some level, usually means they are in the same package,
and the select_idle_sibling() start search from the top cache share
level, usually means search all the smp in one package, so, usually,
both of the cases will only locate idle cpu in their package.

I could not figure out what's the case that the wake_affine() could
benefit a lot, but it do harm a lot according to the testing results.

>>>> So the new logical in this patch set is:
>>>> new_cpu = select_idle_sibling(prev_cpu)
>>>> if idle_cpu(new_cpu)
>>>> return new_cpu
>>> So you tilted the scales in favor of leaving tasks in their current
>>> package, which should benefit large footprint tasks, but should also
>>> penalize light communicating tasks.
>> Yes, I'd prefer to wakeup the task on a cpu which:
>> 1. idle
>> 2. close to prev_cpu
>> So if both curr_cpu and prev_cpu have idle cpu in their topology, which
>> one is better? that depends on how task benefit from cache and the
>> balance situation, whatever, I don't think the benefit worth the high
>> cost of wake_affine() in most cases...
> We've always used wake_affine() before, yet been able to schedule at
> high frequency, so I don't see that it can be _that_ expensive. I
> haven't actually measured lately (loooong time) though.
> WRT cost/benefit of migration, yeah, it depends entirely on the tasks,
> some will gain, some will lose. On a modern single processor box, it
> just doesn't matter, there's only one llc (two s_i_s() calls = oopsie),
> but on my beloved old Q6600 or a big box, it'll matter a lot to
> something. NUMA balance will deal with big boxen, my trusty old Q6600
> will likely get all upset with some localhost stuff.

But is this patch set really cause regression on your Q6600? It may
sacrificed some thing, but I still think it will benefit far more,
especially on huge systems.

Michael Wang

> -Mike
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at
> Please read the FAQ at

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at