Re: [RFC PATCH 08/23] sched/fair: Compute task-class performance scores for load balancing
From: Peter Zijlstra
Date: Wed Oct 26 2022 - 04:55:44 EST
On Tue, Oct 25, 2022 at 08:57:24PM -0700, Ricardo Neri wrote:
> Do you want me to add your Signed-off-by and Co-developed-by tags?
Nah; who cares ;-)
> > @@ -8749,32 +8747,18 @@ static void compute_ilb_sg_task_class_sc
> > if (!busy_cpus)
> > return;
> >
> > - score_on_dst_cpu = arch_get_task_class_score(class_sgs->p_min_score->class,
> > - dst_cpu);
> > + score_on_dst_cpu = arch_get_task_class_score(sgcs->min_class, dst_cpu);
> >
> > - /*
> > - * The simpest case. The single busy CPU in the current group will
> > - * become idle after pulling its current task. The destination CPU is
> > - * idle.
> > - */
> > - if (busy_cpus == 1) {
> > - sgs->task_class_score_before = class_sgs->sum_score;
> > - sgs->task_class_score_after = score_on_dst_cpu;
> > - return;
> > - }
> > + before = sgcs->sum_score
> > + after = before - sgcs->min_score + score_on_dst_cpu;
>
> This works when the sched group being evaluated has only one busy CPU
> because it will become idle if the destination CPU (which was idle) pulls
> the current task.
>
> >
> > - /*
> > - * Now compute the group score with and without the task with the
> > - * lowest score. We assume that the tasks that remain in the group share
> > - * the CPU resources equally.
> > - */
> > - group_score = class_sgs->sum_score / busy_cpus;
> > -
> > - group_score_without = (class_sgs->sum_score - class_sgs->min_score) /
> > - (busy_cpus - 1);
> > + if (busy_cpus > 1) {
> > + before /= busy_cpus;
> > + after /= busy_cpus;
>
>
> However, I don't think this works when the sched group has more than one
> busy CPU. 'before' and 'after' reflect the total throughput score of both
> the sched group *and* the destination CPU.
>
> One of the CPUs in the sched group will become idle after the balance.
>
> Also, at this point we have already added score_on_dst_cpu. We are incorrectly
> scaling it by the number of busy CPUs in the sched group.
>
> We instead must scale 'after' by busy_cpus - 1 and then add score_on_dst_cpu.
So none of that makes sense.
'x/n + y' != '(x+y)/(n+1)'
IOW:
> > + }
> >
> > - sgs->task_class_score_after = group_score_without + score_on_dst_cpu;
> > - sgs->task_class_score_before = group_score;
> > + sgs->task_class_score_before = before;
> > + sgs->task_class_score_after = after;
your task_class_score_after is a sum value for 2 cpus worth, not a value
for a single cpu, while your task_class_score_before is a single cpu
average. You can't compare these numbers and have a sensible outcome.
If you have a number of values: x_1....x_n, their average is
Sum(x_1...x_n) / n, which is a single value again.
If you want to update one of the x's, say x_i->x'_i, then the average
changes like:
Sum(x_1...x_n-x_i+x'_i) / n
If you want to remove one of the x's, then you get:
Sum(x_1...x_n-x_i) / (n-1) ; 0<i<n
if you want to add an x:
Sum(x_1...x_n+i_i) / (n+1) ; i>n
Nowhere would you ever get:
Sum(x_1...x_n) / n + x_i
That's just straight up nonsense.
So I might buy an argument for:
if (busy_cpus > 1) {
before /= (busy_cpus-1);
after /= (busy_cpus+1);
}
Or something along those lines (where you remove an entry from before
and add it to after), but not this.