Re: [RFC PATCH v3 2/3] sched: Introduce cpus_share_l2c

From: Aaron Lu
Date: Wed Sep 06 2023 - 02:39:10 EST


On Tue, Sep 05, 2023 at 08:46:42AM -0400, Mathieu Desnoyers wrote:
> On 9/5/23 03:21, Aaron Lu wrote:
> > Looks like the reduction in task migration is due to SIS_UTIL, i.e.
> > select_idle_cpu() aborts a lot more after applying this series because
> > system utilization increased.
> >
> > Here are some numbers:
> > @sis @sic @migrate_idle_cpu @abort
> > vanilla: 24640640 15883958 11913588 4148649
> > this_series: 22345434 18597564 4294995 14319284
> >
> > note:
> > - @sis: number of times select_idle_sibling() called;
> > - @sic: number of times select_idle_cpu() called;
> > - @migrate_idle_cpu: number of times task migrated due to
> > select_idle_cpu() found an idle cpu that is different from prev_cpu;
> > - @abort: number of times select_idle_cpu() aborts the search due to
> > SIS_UTIL.
> >
> > All numbers are captured during a 5s window while running the below
> > workload on a 2 sockets Intel SPR(56 cores, 112 threads per socket):
> > hackbench -g 20 -f 20 --pipe --threads -l 480000 -s 100
> >
> > So for this workload, I think this series is doing something good: it
> > increased system utilization and due to SIS_UTIL, it also reduced task
> > migration where task migration isn't very useful since system is already
> > overloaded.
>
> This is interesting. Did you also profile the impact of the patches on
> wake_affine(), especially wake_affine_idle() ? Its behavior did change very

For group=20 case, wake_affine() and wake_affine_idle() don't appear to
change much on this Intel machine, in that target received by sis() is
mostly prev_cpu instead of waker(this) cpu for both kernels.

But I do notice for group=32 case, in vanilla kernel, the chance of target
as received by sis() becoming to waker cpu increased a lot while with
this series, targer remains mostly prev_cpu and that is the reason why
migration dropped with this series for group=32 case becasue when sis()
fallback to use target, this series has a higher chance of not mirgating
the task. And my profile shows for vanilla kernel, when it choose target
as waker cpu, it's mostly due to wake_affine_weight(), not wake_affine_idle().

Thanks,
Aaron

> significantly in my tests, and this impacts the target cpu number received
> by select_idle_sibling(). But independently of what wake_affine() returns as
> target (waker cpu or prev_cpu), if select_idle_cpu() is trigger-happy and
> finds idle cores near that target, this will cause lots of migrations.
>
> Based on your metrics, the ttwu-queued-l2 approach (in addition to reduce
> lock contention) appear to decrease the SIS_UTIL idleless level of the cpus
> enough to completely change the runqueue selection and migration behavior.
>
> I fear that we hide a bad scheduler behavior under the rug by changing the
> idleless level of a specific workload pattern, while leaving the underlying
> root cause unfixed.
>
> I'm currently working on a different approach: rate limit migrations.
> Basically, the idea is to detect when a task is migrated too often for its
> own good, and prevent the scheduler from migrating it for a short while. I
> get about 30% performance improvement with this approach as well (limit
> migration to 1 per 2ms window per task). I'll finish polishing my commit
> messages and send a series as RFC soon.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>