Re: [PATCH] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

From: Vincent Guittot
Date: Thu Jul 02 2020 - 09:45:26 EST


On Thu, 2 Jul 2020 at 15:29, Mel Gorman <mgorman@xxxxxxx> wrote:
>
> On Thu, Jul 02, 2020 at 11:27:52AM +0200, Dietmar Eggemann wrote:
> > On 17/06/2020 16:52, Peter Puhov wrote:
> > > On Wed, 17 Jun 2020 at 06:50, Valentin Schneider
> > > <valentin.schneider@xxxxxxx> wrote:
> > >>
> > >>
> > >> On 16/06/20 17:48, peter.puhov@xxxxxxxxxx wrote:
> > >>> From: Peter Puhov <peter.puhov@xxxxxxxxxx>
> > >>> We tested this patch with following benchmarks:
> > >>> perf bench -f simple sched pipe -l 4000000
> > >>> perf bench -f simple sched messaging -l 30000
> > >>> perf bench -f simple mem memset -s 3GB -l 15 -f default
> > >>> perf bench -f simple futex wake -s -t 640 -w 1
> > >>> sysbench cpu --threads=8 --cpu-max-prime=10000 run
> > >>> sysbench memory --memory-access-mode=rnd --threads=8 run
> > >>> sysbench threads --threads=8 run
> > >>> sysbench mutex --mutex-num=1 --threads=8 run
> > >>> hackbench --loops 20000
> > >>> hackbench --pipe --threads --loops 20000
> > >>> hackbench --pipe --threads --loops 20000 --datasize 4096
> > >>>
> > >>> and found some performance improvements in:
> > >>> sysbench threads
> > >>> sysbench mutex
> > >>> perf bench futex wake
> > >>> and no regressions in others.
> > >>>
> > >>
> > >> One nitpick for the results of those: condensing them in a table form would
> > >> make them more reader-friendly. Perhaps something like:
> > >>
> > >> | Benchmark | Metric | Lower is better? | BASELINE | SERIES | DELTA |
> > >> |------------------+----------+------------------+----------+--------+-------|
> > >> | Sysbench threads | # events | No | 45526 | 56567 | +24% |
> > >> | Sysbench mutex | ... | | | | |
> > >>
> > >> If you want to include more stats for each benchmark, you could have one table
> > >> per (e.g. see [1]) - it'd still be a more readable form (or so I believe).
> >
> > Wouldn't Unix Bench's 'execl' and 'spawn' be the ultimate test cases
> > for those kind of changes?
> >
> > I only see minor improvements with tip/sched/core as base on hikey620
> > (Arm64 octa-core).
> >
> > base w/ patch
> > ./Run spawn -c 8 -i 10 633.6 635.1
> >
> > ./Run execl -c 8 -i 10 1187.5 1190.7
> >
> >
> > At the end of find_idlest_group(), when comparing local and idlest, it
> > is explicitly mentioned that number of idle_cpus is used instead of
> > utilization.
> > The comparision between potential idle groups and local & idlest group
> > should probably follow the same rules.

Comparing the number of idle cpu is not enough in the case described
by Peter because the newly forked thread sleeps immediately and before
we select cpu for the next one. This is shown in the trace where the
same CPU7 is selected for all wakeup_new events.
That's why, looking at utilization when there is the same number of
CPU is a good way to see where the previous task was placed. Using
nr_running doesn't solve the problem because newly forked task is not
running and the cpu would not have been idle in this case and an idle
CPU would have been selected instead

> >
>
> There is the secondary hazard that what update_sd_pick_busiest returns
> is checked later by find_busiest_group when considering the imbalance
> between NUMA nodes. This particular patch splits basic communicating tasks
> cross-node again at fork time so cross node communication is reintroduced
> (same applies if sum_nr_running is used instead of group_util).

As long as there is an idle cpu in the node, new thread doesn't cross
node like previously. The only difference happens inside the node

>
> --
> Mel Gorman
> SUSE Labs