Re: [RFC PATCH v3 2/3] sched: Introduce cpus_share_l2c

From: Mathieu Desnoyers
Date: Tue Sep 05 2023 - 11:59:57 EST


On 9/5/23 03:21, Aaron Lu wrote:
On Fri, Sep 01, 2023 at 09:45:28PM +0800, Aaron Lu wrote:
On Mon, Aug 28, 2023 at 07:19:45PM +0800, Aaron Lu wrote:
On Fri, Aug 25, 2023 at 09:51:19AM -0400, Mathieu Desnoyers wrote:
On 8/25/23 02:49, Aaron Lu wrote:
On Thu, Aug 24, 2023 at 10:40:45AM -0400, Mathieu Desnoyers wrote:
[...]
- task migrations dropped with this series for nr_group=20 and 32
according to 'perf stat'. migration number didn't drop for nr_group=10
but the two update functions' cost dropped which means fewer access to
tg->load_avg and thus, fewer task migrations. This is contradictory
and I can not explain yet;

Neither can I.


[...]


It's not clear to me why this series can reduce task migrations. I doubt
it has something to do with more wakelist style wakeup becasue for this
test machine, only a single core with two SMT threads share L2 so more
wakeups are through wakelist. In wakelist style wakeup, the target rq's
ttwu_pending is set and that will make the target cpu as !idle_cpu();
This is faster than grabbing the target rq's lock and then increase
target rq's nr_running or set target rq's curr to something else than
idle. So wakelist style wakeup can make target cpu appear as non idle
faster, but I can't connect this with reduced migration yet, I just feel
this might be the reason why task migration reduced.


[...]
I've tried adding checks for rq->ttwu_pending in those code paths on top of
my patch and I'm still observing the reduction in number of migrations, so
it's unclear to me how doing more queued wakeups can reduce migrations the
way it does.

An interesting puzzle.

One metric that can help understand the impact of my patch: comparing
hackbench from a baseline where only your load_avg patch is applied
to a kernel with my l2c patch applied, I notice that the goidle
schedstat is cut in half. For a given CPU (they are pretty much alike),
it goes from 650456 to 353487.

So could it be that by doing queued wakeups, we end up batching
execution of the woken up tasks for a given CPU, rather than going
back and forth between idle and non-idle ? One important thing that
this changes is to reduce the number of newidle balance triggered.

I noticed the majority(>99%) migrations are from wakeup path on this
Intel SPR when running hackbench: ttwu() -> set_task_cpu() ->
migrate_task_rq_fair(), so while I think it's a good finding that
newidle balance dropped, it's probably not the reason why migration
number dropped...

I profiled select_idle_sibling() and found that with this series,
select_idle_cpu() tends to fail more and select_idle_sibling() fallbacks
to use target in the end, which equals to prev_cpu very often.

Initially I think the reason why select_idle_cpu() failed more with this
series is because "wake_list style enqueue" can make the target cpu appear
as busy earlier and thus, it will be harder for select_idle_cpu() to
find an idle cpu overall. But I also suspect SIS_UTIL makes a difference
here: in vanilla kernel, the idle% is 8% and with this series, the idle%
is only 2% and SIS_UTIL may simply skip doing any search for idle cpu.
Anyway, I think I'll also need to profile select_idle_cpu() to see
what's going on there too.

Looks like the reduction in task migration is due to SIS_UTIL, i.e.
select_idle_cpu() aborts a lot more after applying this series because
system utilization increased.

Here are some numbers:
@sis @sic @migrate_idle_cpu @abort
vanilla: 24640640 15883958 11913588 4148649
this_series: 22345434 18597564 4294995 14319284

note:
- @sis: number of times select_idle_sibling() called;
- @sic: number of times select_idle_cpu() called;
- @migrate_idle_cpu: number of times task migrated due to
select_idle_cpu() found an idle cpu that is different from prev_cpu;
- @abort: number of times select_idle_cpu() aborts the search due to
SIS_UTIL.

All numbers are captured during a 5s window while running the below
workload on a 2 sockets Intel SPR(56 cores, 112 threads per socket):
hackbench -g 20 -f 20 --pipe --threads -l 480000 -s 100

So for this workload, I think this series is doing something good: it
increased system utilization and due to SIS_UTIL, it also reduced task
migration where task migration isn't very useful since system is already
overloaded.

This is interesting. Did you also profile the impact of the patches on wake_affine(), especially wake_affine_idle() ? Its behavior did change very significantly in my tests, and this impacts the target cpu number received by select_idle_sibling(). But independently of what wake_affine() returns as target (waker cpu or prev_cpu), if select_idle_cpu() is trigger-happy and finds idle cores near that target, this will cause lots of migrations.

Based on your metrics, the ttwu-queued-l2 approach (in addition to reduce lock contention) appear to decrease the SIS_UTIL idleless level of the cpus enough to completely change the runqueue selection and migration behavior.

I fear that we hide a bad scheduler behavior under the rug by changing the idleless level of a specific workload pattern, while leaving the underlying root cause unfixed.

I'm currently working on a different approach: rate limit migrations. Basically, the idea is to detect when a task is migrated too often for its own good, and prevent the scheduler from migrating it for a short while. I get about 30% performance improvement with this approach as well (limit migration to 1 per 2ms window per task). I'll finish polishing my commit messages and send a series as RFC soon.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com