Re: [PATCH 4/4] sched/topology: the group balance cpu must be a cpu where the group is installed

From: Lauro Venancio
Date: Tue Apr 25 2017 - 10:34:18 EST


On 04/25/2017 09:17 AM, Peter Zijlstra wrote:
> On Mon, Apr 24, 2017 at 12:11:59PM -0300, Lauro Venancio wrote:
>> On 04/24/2017 10:03 AM, Peter Zijlstra wrote:
>>> On Thu, Apr 20, 2017 at 04:51:43PM -0300, Lauro Ramos Venancio wrote:
>>>
>>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>>> index e77c93a..694e799 100644
>>>> --- a/kernel/sched/topology.c
>>>> +++ b/kernel/sched/topology.c
>>>> @@ -505,7 +507,11 @@ static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
>>>>
>>>> for_each_cpu(i, sg_span) {
>>>> sibling = *per_cpu_ptr(sdd->sd, i);
>>>> - if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
>>>> + if (!cpumask_equal(sg_span, sched_group_cpus(sibling->groups)))
>>>> continue;
> Hmm _this_ is what requires us to move the thing to a whole separate
> iteration. Because when we build the groups, the domains are already
> constructed, so that was right.
>
> So the moving crud around wasn't the primary fix, this is.
>
> With the fact that sched_group_cpus(sd->groups) ==
> sched_domain_span(sibling->child) (if child exists) established in the
> previous patches, could we not simplify this like the below?
We can. We just need to better handle the case when there is no child or
we will have empty masks.
We have to replicate the build_group_from_child_sched_domain() behavior:

if (sd->child)
cpumask_copy(sg_span, sched_domain_span(sd->child));
else
cpumask_copy(sg_span, sched_domain_span(sd));


So we need something like:


if (sibling->child)
gsd = sibling->child;
else
gsd = sibling;

if (!cpumask_equal(sg_span, sched_domain_span(gsd)))

continue;


>
> ---
> Subject: sched/topology: Fix overlapping sched_group_mask
> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Date: Tue Apr 25 14:00:49 CEST 2017
>
> The point of sched_group_mask is to select those CPUs from
> sched_group_cpus that can actually arrive at this balance domain.
>
> The current code gets it wrong, as can be readily demonstrated with a
> topology like:
>
> node 0 1 2 3
> 0: 10 20 30 20
> 1: 20 10 20 30
> 2: 30 20 10 20
> 3: 20 30 20 10
>
> Where (for example) domain 1 on CPU1 ends up with a mask that includes
> CPU0:
>
> [] CPU1 attaching sched-domain:
> [] domain 0: span 0-2 level NUMA
> [] groups: 1 (mask: 1), 2, 0
> [] domain 1: span 0-3 level NUMA
> [] groups: 0-2 (mask: 0-2) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
>
> This causes sched_balance_cpu() to compute the wrong CPU and
> consequently should_we_balance() will terminate early resulting in
> missed load-balance opportunities.
>
> The fixed topology looks like:
>
> [] CPU1 attaching sched-domain:
> [] domain 0: span 0-2 level NUMA
> [] groups: 1 (mask: 1), 2, 0
> [] domain 1: span 0-3 level NUMA
> [] groups: 0-2 (mask: 1) (cpu_capacity: 3072), 0,2-3 (cpu_capacity: 3072)
>
> Debugged-by: Lauro Ramos Venancio <lvenanci@xxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> ---
> kernel/sched/topology.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -495,6 +495,9 @@ enum s_alloc {
> /*
> * Build an iteration mask that can exclude certain CPUs from the upwards
> * domain traversal.
> + *
> + * Only CPUs that can arrive at this group should be considered to continue
> + * balancing.
> */
> static void build_group_mask(struct sched_domain *sd, struct sched_group *sg)
> {
> @@ -505,7 +508,13 @@ static void build_group_mask(struct sche
>
> for_each_cpu(i, sg_span) {
> sibling = *per_cpu_ptr(sdd->sd, i);
> - if (!cpumask_test_cpu(i, sched_domain_span(sibling)))
> +
> + /* overlap should have children; except for FORCE_SD_OVERLAP */
> + if (WARN_ON_ONCE(!sibling->child))
> + continue;
> +
> + /* If we would not end up here, we can't continue from here */
> + if (!cpumask_equal(sg_span, sched_domain_span(sibling->child)))
> continue;
>
> cpumask_set_cpu(i, sched_group_mask(sg));
>