Re: [LKP] [sched/fair] 6c8116c914: stress-ng.mmapfork.ops_per_sec -38.0% regression

From: Vincent Guittot
Date: Tue Jun 30 2020 - 10:22:27 EST


Hi Tao,

On Tue, 30 Jun 2020 at 11:41, Tao Zhou <ouwen210@xxxxxxxxxxx> wrote:
>
> Hi,
>
> On Tue, Jun 30, 2020 at 09:43:11AM +0200, Vincent Guittot wrote:
> > Hi Tao,
> >
> > Le lundi 15 juin 2020 Ã 16:14:01 (+0800), Xing Zhengjun a Ãcrit :
> > >
> > >
> > > On 6/15/2020 1:18 PM, Tao Zhou wrote:
> >
> > ...
> >
> > > I apply the patch based on v5.7, the regression still existed.
> >
> >
> > Could you try the patch below ? This patch is not a real fix because it impacts performance of others benchmarks but it will at least narrow your problem.
> >
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9f78eb76f6fb..a4d8614b1854 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8915,9 +8915,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > * and consider staying local.
> > */
> >
> > - if ((sd->flags & SD_NUMA) &&
> > - ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load))
> > - return NULL;
> > +// if ((sd->flags & SD_NUMA) &&
> > +// ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load))
> > +// return NULL;
>
> Just narrow to the fork (wakeup) path that maybe lead the problem, /me think.

The perf regression seems to be fixed with this patch on my setup.
According to the statistics that I have on the use case, groups are
overloaded but load is quite low and this low level hits this NUMA
specific condition

> Some days ago, I tried this patch:
>
> https://lore.kernel.org/lkml/20200616164801.18644-1-peter.puhov@xxxxxxxxxx/
>
> ---
> kernel/sched/fair.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 02f323b85b6d..abcbdf80ee75 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8662,8 +8662,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
>
> case group_has_spare:
> /* Select group with most idle CPUs */
> - if (idlest_sgs->idle_cpus >= sgs->idle_cpus)
> + if (idlest_sgs->idle_cpus > sgs->idle_cpus)
> return false;
> +
> + /* Select group with lowest group_util */
> + if (idlest_sgs->idle_cpus == sgs->idle_cpus &&
> + idlest_sgs->group_util <= sgs->group_util)
> + return false;
> +
> break;
> }
>
> --
>
> This patch is related to wake up slow path and group type is full_busy.

I tried it but haven't seen impacts on mmapfork test results

> What I tried that got improved:
>
> $> sysbench threads --threads=16 run
>
> The total num of event(high is better):
>
> v5.8-rc1 : 34020 34494 33561
> v5.8-rc1+patch: 35466 36184 36260
>
> $> perf bench -f simple sched pipe -l 4000000
>
> v5.8-rc1 : 16.203 16.238 16.150
> v5.8-rc1+patch: 15.757 15.930 15.819
>
> I also saw some regressions about other workloads(dont know much).
> So, suggest to test this patch about this stress-ng.mmapfork. I didn't do
> this yet.
>
> Another patch i want to mention here is this(merged to V5.7 now):
>
> commit 68f7b5cc83 ("sched/cfs: change initial value of runnable_avg")
>
> And this regression happened based on V5.7. This patch is related to fork
> wake up path of overloaded type. Absolutely need to try then.
>
> Finally, I have given a patch that seems not related to fork wake up path,
> but I also tried it on some benchmark. But, did not saw improvement there.
> I also give this changed patch here(I realized that full_busy type idle cpu
> first but not sure). Maybe not need to try.
>
> Index: core.bak/kernel/sched/fair.c
> ===================================================================
> --- core.bak.orig/kernel/sched/fair.c
> +++ core.bak/kernel/sched/fair.c
> @@ -9226,17 +9226,20 @@ static struct sched_group *find_busiest_
> goto out_balanced;
>
> if (busiest->group_weight > 1 &&
> - local->idle_cpus <= (busiest->idle_cpus + 1))
> - /*
> - * If the busiest group is not overloaded
> - * and there is no imbalance between this and busiest
> - * group wrt idle CPUs, it is balanced. The imbalance
> - * becomes significant if the diff is greater than 1
> - * otherwise we might end up to just move the imbalance
> - * on another group. Of course this applies only if
> - * there is more than 1 CPU per group.
> - */
> - goto out_balanced;
> + local->idle_cpus <= (busiest->idle_cpus + 1)) {
> + if (local->group_type == group_has_spare) {
> + /*
> + * If the busiest group is not overloaded
> + * and there is no imbalance between this and busiest
> + * group wrt idle CPUs, it is balanced. The imbalance
> + * becomes significant if the diff is greater than 1
> + * otherwise we might end up to just move the imbalance
> + * on another group. Of course this applies only if
> + * there is more than 1 CPU per group.
> + */
> + goto out_balanced;
> + }
> + }
>
> if (busiest->sum_h_nr_running == 1)
> /*
>
>
> TBH, I don't know much about the below numbers.
>
> Thank you for the help!
>
> Thanks.
>
> > /*
> > * If the local group is less loaded than the selected
> >
> > --
> >
> >
> > > =========================================================================================
> > > tbox_group/testcase/rootfs/kconfig/compiler/nr_threads/disk/sc_pid_max/testtime/class/cpufreq_governor/ucode:
> > >
> > > lkp-bdw-ep6/stress-ng/debian-x86_64-20191114.cgz/x86_64-rhel-7.6/gcc-7/100%/1HDD/4194304/1s/scheduler/performance/0xb000038
> > >
> > > commit:
> > > e94f80f6c49020008e6fa0f3d4b806b8595d17d8
> > > 6c8116c914b65be5e4d6f66d69c8142eb0648c22
> > > v5.7
> > > c7e6d37f60da32f808140b1b7dabcc3cde73c4cc (Tao's patch)
> > >
> > > e94f80f6c4902000 6c8116c914b65be5e4d6f66d69c v5.7
> > > c7e6d37f60da32f808140b1b7da
> > > ---------------- --------------------------- ---------------------------
> > > ---------------------------
> > > %stddev %change %stddev %change %stddev %change
> > > %stddev
> > > \ | \ | \
> > > | \
> > > 819250 Â 5% -10.1% 736616 Â 8% +41.2% 1156877 Â 3%
> > > +43.6% 1176246 Â 5% stress-ng.futex.ops
> > > 818985 Â 5% -10.1% 736460 Â 8% +41.2% 1156215 Â 3%
> > > +43.6% 1176055 Â 5% stress-ng.futex.ops_per_sec
> > > 1551 Â 3% -3.4% 1498 Â 5% -4.6% 1480 Â 5%
> > > -14.3% 1329 Â 11% stress-ng.inotify.ops
> > > 1547 Â 3% -3.5% 1492 Â 5% -4.8% 1472 Â 5%
> > > -14.3% 1326 Â 11% stress-ng.inotify.ops_per_sec
> > > 11292 Â 8% -2.8% 10974 Â 8% -9.4% 10225 Â 6%
> > > -10.1% 10146 Â 6% stress-ng.kill.ops
> > > 11317 Â 8% -2.6% 11023 Â 8% -9.1% 10285 Â 5%
> > > -10.3% 10154 Â 6% stress-ng.kill.ops_per_sec
> > > 28.20 Â 4% -35.4% 18.22 -33.4% 18.77
> > > -27.7% 20.40 Â 9% stress-ng.mmapfork.ops_per_sec
> > > 2999012 Â 21% -10.1% 2696954 Â 22% -88.5% 344447 Â 11%
> > > -87.8% 364932 stress-ng.tee.ops_per_sec
> > > 7882 Â 3% -5.4% 7458 Â 4% -2.0% 7724 Â 3%
> > > -2.2% 7709 Â 4% stress-ng.vforkmany.ops
> > > 7804 Â 3% -5.2% 7400 Â 4% -2.0% 7647 Â 3%
> > > -2.1% 7636 Â 4% stress-ng.vforkmany.ops_per_sec
> > > 46745421 Â 3% -8.1% 42938569 Â 3% -5.2% 44312072 Â 4%
> > > -2.3% 45648193 stress-ng.yield.ops
> > > 46734472 Â 3% -8.1% 42926316 Â 3% -5.2% 44290338 Â 4%
> > > -2.4% 45627571 stress-ng.yield.ops_per_sec
> > >
> > >
> > >
> >
> > ...
> >
> > > --
> > > Zhengjun Xing