Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

From: Abel Wu
Date: Sat May 07 2022 - 13:53:55 EST


Hi Chen,

On 5/8/22 12:09 AM, Chen Yu Wrote:
in
Hi Abel,
On Fri, May 6, 2022 at 1:21 AM Abel Wu <wuyun.abel@xxxxxxxxxxxxx> wrote:

Try to improve searching efficiency of SIS by filtering out the
overloaded cpus, and as a result the more overloaded the system
is, the less cpus will be searched.

My understanding is that, this patch aims to address the following issue:
What kind of CPUs should the SIS scan.
And we also have another patch[1] from another angle:
How many CPUs should the SIS scan.
I assume the two direction could both help speed up the SIS process, so

Agreed.

I'm curious what the result would be with both patch applied, and I planned
to run your patch on my system too.
The overloaded cpus are tracked through LLC shared domain. To
regulate accesses to the shared data, the update happens mainly
at the tick. But in order to make it more accurate, we also take
the task migrations into consideration during load balancing which
can be quite frequent due to short running workload causing newly-
idle. Since an overloaded runqueue requires at least 2 non-idle
tasks runnable, we could have more faith on the "frequent newly-
idle" case.

Benchmark
=========

Tests are done in an Intel(R) Xeon(R) Platinum 8260 CPU@2.40GHz
machine with 2 NUMA nodes each of which has 24 cores with SMT2
enabled, so 96 CPUs in total.

All of the benchmarks are done inside a normal cpu cgroup in a
Do you have any script that I can leverage to launch the test?

I benchmarked the following configs in mmtests:
config-scheduler-unbound
config-scheduler-schbench
config-network-netperf-stream-unbound
config-network-tbench

more details: https://github.com/gormanm/mmtests

clean environment with cpu turbo disabled.
I would recommend to apply the following patch(queued for 5.19) if
the intel_pstate driver is loaded, because it seems that there is a
utilization calculation
issue when turbo is disabled:

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 846bb3a78788..2216b24b6f84 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1322,6 +1322,7 @@ static ssize_t store_no_turbo(struct kobject *a,
struct kobj_attribute *b,
mutex_unlock(&intel_pstate_limits_lock);

intel_pstate_update_policies();
+ arch_set_max_freq_ratio(global.no_turbo);

mutex_unlock(&intel_pstate_driver_lock);

--
2.25.1
[cut]

Thanks, I will apply it before next rounds of testing.


v3:
- removed sched-idle balance feature and focus on SIS
- take non-CFS tasks into consideration
- several fixes/improvement suggested by Josh Don

v2:
- several optimizations on sched-idle balancing
- ignore asym topos in can_migrate_task
- add more benchmarks including SIS efficiency
- re-organize patch as suggested by Mel

v1: https://lore.kernel.org/lkml/20220217154403.6497-5-wuyun.abel@xxxxxxxxxxxxx/
v2: https://lore.kernel.org/lkml/20220409135104.3733193-1-wuyun.abel@xxxxxxxxxxxxx/

Signed-off-by: Abel Wu <wuyun.abel@xxxxxxxxxxxxx>
---
include/linux/sched/topology.h | 12 ++++++++++
kernel/sched/core.c | 38 ++++++++++++++++++++++++++++++
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++-------
kernel/sched/idle.c | 1 +
kernel/sched/sched.h | 4 ++++
kernel/sched/topology.c | 4 +++-
6 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..95c7ad1e05b5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,8 +81,20 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+ /*
+ * Tracking of the overloaded cpus can be heavy, so start
+ * a new cacheline to avoid false sharing.
+ */
Although we put the following items into different cache line compared to
above ones, is it possible that there is still cache false sharing if
CPU1 is reading nr_overloaded_cpus while
CPU2 is updating overloaded_cpus?

I think it's not false sharing, it's just cache contention. But yes,
it is still possible if the two items mixed with others (by compiler)
in one cacheline, which seems out of our control..

+ atomic_t nr_overloaded_cpus ____cacheline_aligned;
____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
be more lightweight?

I didn't see the difference of the two macros, it would be appreciate
if you can shed some light.

+ unsigned long overloaded_cpus[]; /* Must be last */
};

[cut]

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d4bd299d67ab..79b4ff24faee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ struct sched_domain_shared *sds = sd->shared;
+ int nr, nro, weight = sd->span_weight;
+ int i, cpu, idle_cpu = -1;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
struct sched_domain *this_sd;
@@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (!this_sd)
return -1;

+ nro = atomic_read(&sds->nr_overloaded_cpus);
+ if (nro == weight)
+ goto out;
+
+ nr = min_t(int, weight, p->nr_cpus_allowed);
+
+ /*
+ * It's unlikely to find an idle cpu if the system is under
+ * heavy pressure, so skip searching to save a few cycles
+ * and relieve cache traffic.
+ */
+ if (weight - nro < (nr >> 4) && !has_idle_core)
+ return -1;
In [1] we used util_avg to check if the domain is overloaded and quit
earlier, since util_avg would be
more stable and contains historic data. But I think nr_running in your
patch could be used as
complementary metric and added to update_idle_cpu_scan() in [1] IMO.
+
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ if (nro > 1)
+ cpumask_andnot(cpus, cpus, sdo_mask(sds));
If I understand correctly, this is the core of the optimization: SIS
filters out the busy cores. I wonder if it
is possible to save historic h_nr_running/idle_h_nr_running and use
the average value? (like the calculation
of avg_scan_cost).

Yes, I have been already working on that for several days, and
along with some improvement on load balance (group_has_spare).
Ideally we can finally get rid out of the cache issues.



if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
@@ -6354,7 +6372,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
avg_idle = this_rq->wake_avg_idle;
avg_cost = this_sd->avg_scan_cost + 1;

- span_avg = sd->span_weight * avg_idle;
+ span_avg = weight * avg_idle;
if (span_avg > 4*avg_cost)
nr = div_u64(span_avg, avg_cost);
else
[cut]
[1] https://lore.kernel.org/lkml/20220428182442.659294-1-yu.c.chen@xxxxxxxxx/


I followed all 3 versions of your patch, and I think your work makes
sense. My only concern was that the depth is updated every llc_size
milliseconds which could be a long period in large machines and the
load can vary quickly enough to deviate from the historic value. But
it seems not a big deal as we discussed in your v1 patch.

Thanks & BR,
Abel