Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()

From: Jeffrey Hugo
Date: Fri Jul 28 2017 - 14:09:53 EST


On 7/28/2017 7:34 AM, Dietmar Eggemann wrote:
On 28/07/17 13:59, Peter Zijlstra wrote:
On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
level sched groups.

[...]

If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
Then instantly they balance and I get 2x100% back.

Yeah, could reproduce on IVB-EP (2x10x2).

OK, I have one of those. What should I do, because I didn't actually see
anything odd.

Me neither. Sorry, I was unclear. I meant I could reproduce your
example, that one of the 50% task moves to the idle cpu on this machine.

[...]


Dietmar's assessment is essentially correct. We are seeing a circumstance where fix_small_imbalance() could fix a situation with an idle core and an over-worked core, but it is being skipped. Here is a detailed step through of the specific case we're seeing.

Consider the following system.

DIE [ (all cores) ]

MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]

Not NUMA, not SMT.

Imagine we have a workload which spawns 5 CPU intensive workers, which we want taskset to 5 cores such that each worker has its own core. We want to see:

DIE [ (all cores) ]

MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T T

Where * indicates the taskset, and T indicates each thread/process.

In many cases, workloads are not ideally assigned at fork(), because load statistics haven't had a chance to be generated by the new work. This results in the following:

DIE [ (all cores) ]

MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T
T

Core 4 is over-assigned (more workers than it can efficiently handle) and core 2 is under-assigned (no work to do). Not great, but we expect the system to load_balance() quickly, and move one T from core 4 to core 2.

We do not see this occur, which means the scheduler is not doing its job by leaving a core idle when there's work to do.

What happens -

Core 5 is idle, so it triggers load_balance(). At the MC level, it identifies core 4 as busiest, and attempts to pull load. Core 5 fails because it is not part of the taskset, and load cannot be moved to another available core (6 or 7), so the group_imbalance flag is set.

Core 2 is idle, so it triggers load_balance(). It won't pull any load from its siblings in MC (0, 1, or 3) because doing so does not solve an imbalance, it just moves the imbalance around. load_balance() at MC fails, so it moves up to DIE. Since the group containing core 4 is group_imbalance, it is selected as busiest.

We go to calculate_imbalance(). Since each T is CPU intensive, their load is high, lets say 200, thus busiest->load_per_task is 200. However, we have a lot of idle or lightly loaded cores at the DIE level, so the sds average (sds->avg_load) gets pushed down significantly; lets say it is 20. Since busiest->group_imbalance is true, we take the min of busiest->load_per_task and sds average (200 and 20), and make that the new busiest->load_per_task.

We calculate env->imbalance, which we'll say is 48.

Now we hit the end check -
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
* a think about bumping its value to force at least one task to be
* moved
*/
if (env->imbalance < busiest->load_per_task)
return fix_small_imbalance(env, sds);

We try to see if the calculated imbalance can be solved by moving tasks around by comparing the calculated imbalance (env->imbalance) to the calculated average load per task in the busiest group (busiest->load_per_task). Since busiest->load_per_task is an average, either there are tasks with load less than that, or all tasks are at that value, thus if env->imbalance is more than busiest->load_per_task, we have candidate tasks we can evaluate to address the imbalance.

env->imbalance is 48, and the modified busiest->load_per_task is 20, so the check fails, and we exit out of calculate_imbalance() with an imbalance to solve of 48. We jump all the way up to load_balance(), and then into detach_tasks(). We find two tasks that are migration candidates because core 2 is in the same taskset as core 4, so we evaluate if they can help solve the imbalance, or would just move the imbalance around -

if ((load / 2) > env->imbalance)
goto next;

Load of each task is 200, so 200 / 2 = 100, 100 > 48. We can't move anything as doing so would not help solve the imbalance, just move it around, and thus why take the hit of actually migrating the process.

We might not resolve the imbalance, but we get better CPU usage if we move a task from core 4 to core 2, which is what we expect. Based on a comment within the function fix_small_imbalance(), this appears to be the correct path for this circumstance.

In fix_small_imbalance() -

/*
* OK, we don't have enough imbalance to justify moving tasks,
* however we may be able to increase total CPU capacity used by
* moving them.
*/

As mentioned above, we aren't hitting this function path for the following reason:
if (env->imbalance < busiest->load_per_task)
return fix_small_imbalance(env, sds);

env->imbalance is 48, and busiest->load_per_task is 20, but busiest->load_per_task was 200. 48 < 200 is true, so we would hit fix_small_imbalance. The group_imbalance path modifies busiest->load_per_task, so what happens when we remove that path (i.e. this change)? We see the behavior we expect.

This -

DIE [ (all cores) ]

MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T
T

Goes to this -

DIE [ (all cores) ]

MC [0 1 2 3] [4 5 6 7] [8 9 10 11] ... [60 61 62 63]
* * * * *
T T T T T



So, we've identified an issue, and root caused it to a small code snippet. Great for us, but the scheduler is generic code, it needs to work for all. Can this code really be removed?

Here is our current thinking on all uses of this code path. Let us know if we missed anything.

SMT avoids this issue because of SD_PREFER_SIBLING - group overloaded is like group imbalanced, but does not set busiest->group_imbalance so busiest->load_per_task never gets modified.

In cases where the path is actually used:

1- busiest->group_imbalance is not set. The code path is equal to a NOP, therefore it has no positive or negative effect.

2- busiest->group_imbalance is set, but sds->avg_load is greater than busiest->load_per_task. busiest->load_per_task gets assigned the value of busiest->load_per_task, so nothing changes. The code path is equal to a NOP, therefore it has no positive or negative effect.

3- busiest->group_imbalance is set, but sds->avg_load is lesser than busiest->load_per_task. busiest->load_per_task gets assigned the value of sds->avg_load. If the calculated imbalance (env->imbalance) is more than the original busiest->load_per_task, it will be more than sds->avg_load, so we do not hit fix_small_imbalance() as expected. The code path is equal to a NOP, therefore it has no positive or negative effect.

4- busiest->group_imbalance is set, but sds->avg_load is lesser than busiest->load_per_task. busiest->load_per_task gets assigned the value of sds->avg_load. If the calculated imbalance (env->imbalance) is less than the original busiest->load_per_task, but also less than sds->avg_load, we hit fix_small_imbalance(), but would have hit that anyways. The code path is equal to a NOP, therefore it has no positive or negative effect.

5- busiest->group_imbalance is set, but sds->avg_load is lesser than busiest->load_per_task. busiest->load_per_task gets assigned the value of sds->avg_load. If the calculated imbalance (env->imbalance) is less than the original busiest->load_per_task, but more than sds->avg_load, we do not hit fix_small_imbalance(). If we happen to have tasks we can move around based on their load values compared to the current env->imbalance, we can make migrations. The code path is equal to a NOP, therefore it has no positive or negative effect.

6- busiest->group_imbalance is set, but sds->avg_load is lesser than busiest->load_per_task. busiest->load_per_task gets assigned the value of sds->avg_load. If the calculated imbalance (env->imbalance) is less than the original busiest->load_per_task, but more than sds->avg_load, we do not hit fix_small_imbalance(). If we do not have tasks which can migrate based on their load value compared to the current env->imbalance, we hit the issue we've described above. The code path has affected the behavior of the scheduler in a way which is not expected, therefore it has a negative effect.

So, in all of these scenarios but one, the group_imbalance code path of calculate_imbalance() does nothing. In the remaining one scenario, the code path prevents the scheduler from doing its job (assigning work to idle cores), so it harms the system.

Thus we conclude that the code has no benefit, and should be removed.

Considered from a different perspective, busiest->load_per_task, in the current algorithm, represents actual load that can be moved around. Setting it to a global average breaks this invariant and causes a faulty decision at the fix_small_imbalance() path.


In summary, our system is harmed by the code branch, and our analysis indicates the branch has no benefit to the scheduler as it sits today, we feel justified in removing it.

--
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.