Re: [RFC v2 2/2] sched/fair: introduce sched-idle balance

From: Abel Wu
Date: Tue Apr 12 2022 - 13:56:20 EST


Hi Josh,

On 4/12/22 9:59 AM, Josh Don Wrote:
Hi Abel,


+static inline bool cfs_rq_busy(struct rq *rq)
+{
+ return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running == 1;
+}
+
+static inline bool need_pull_cfs_task(struct rq *rq)
+{
+ return rq->cfs.h_nr_running == rq->cfs.idle_h_nr_running;
+}

Note that this will also return true if there are 0 tasks, which I
don't think is the semantics you intend for its use in
rebalance_domains() below.

I intended covering the idle balance. My last v1 patchset wanted to
ignore the idle balance because of the high cpu wakeup latency, but
after some benchmarking, that seems not necessary.


/*
* Use locality-friendly rq->overloaded to cache the status of the rq
* to minimize the heavy cost on LLC shared data.
@@ -7837,6 +7867,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (kthread_is_per_cpu(p))
return 0;

+ if (unlikely(task_h_idle(p))) {
+ /*
+ * Disregard hierarchically idle tasks during sched-idle
+ * load balancing.
+ */
+ if (env->idle == CPU_SCHED_IDLE)
+ return 0;
+ } else if (!static_branch_unlikely(&sched_asym_cpucapacity)) {
+ /*
+ * It's not gonna help if stacking non-idle tasks on one
+ * cpu while leaving some idle.
+ */
+ if (cfs_rq_busy(env->src_rq) && !need_pull_cfs_task(env->dst_rq))
+ return 0;

These checks don't involve the task at all, so this kind of check
should be pushed into the more general load balance function. But, I'm
not totally clear on the motivation here. If we have cpu A with 1
non-idle task and 100 idle tasks, and cpu B with 1 non-idle task, we
should definitely try to load balance some of the idle tasks from A to
B. idle tasks _do_ get time to run (although little), and this can add
up and cause antagonism to the non-idle task if there are a lot of
idle threads.

CPU_SCHED_IDLE means triggered by sched_idle_balance() in which pulls
a non-idle task for the unoccupied cpu from the overloaded ones, so
idle tasks are not the target and should be skipped.

The second part is: if we have cpu A with 1 non-idle task and 100 idle
tasks, and B with >=1 non-idle task, we don't migrate the last non-idle
task on A to B.



/*
+ * The sched-idle balancing tries to make full use of cpu capacity
+ * for non-idle tasks by pulling them for the unoccupied cpus from
+ * the overloaded ones.
+ *
+ * Return 1 if pulled successfully, 0 otherwise.
+ */
+static int sched_idle_balance(struct rq *dst_rq)
+{
+ struct sched_domain *sd;
+ struct task_struct *p = NULL;
+ int dst_cpu = cpu_of(dst_rq), cpu;
+
+ sd = rcu_dereference(per_cpu(sd_llc, dst_cpu));
+ if (unlikely(!sd))
+ return 0;
+
+ if (!atomic_read(&sd->shared->nr_overloaded))
+ return 0;
+
+ for_each_cpu_wrap(cpu, sdo_mask(sd->shared), dst_cpu + 1) {
+ struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ struct lb_env env;
+
+ if (cpu == dst_cpu || !cfs_rq_overloaded(rq) ||
+ READ_ONCE(rq->sched_idle_balance))
+ continue;
+
+ WRITE_ONCE(rq->sched_idle_balance, 1);
+ rq_lock_irqsave(rq, &rf);
+
+ env = (struct lb_env) {
+ .sd = sd,
+ .dst_cpu = dst_cpu,
+ .dst_rq = dst_rq,
+ .src_cpu = cpu,
+ .src_rq = rq,
+ .idle = CPU_SCHED_IDLE, /* non-idle only */
+ .flags = LBF_DST_PINNED, /* pin dst_cpu */
+ };
+
+ update_rq_clock(rq);
+ p = detach_one_task(&env);
+ if (p)
+ update_overload_status(rq);
+
+ rq_unlock(rq, &rf);
+ WRITE_ONCE(rq->sched_idle_balance, 0);
+
+ if (p) {
+ attach_one_task(dst_rq, p);
+ local_irq_restore(rf.flags);
+ return 1;
+ }
+
+ local_irq_restore(rf.flags);
+ }
+
+ return 0;
+}

I think this could probably be integrated with the load balancing
function. Your goal is ignore idle tasks for the purpose of pulling
from a remote rq. And I think the above isn't exactly what you want
anyway; detach_tasks/detach_one_task are just going to iterate the
task list in order. You want to actually look for the non-idle tasks
explicitly.

I have tried a simple version like below (and sched_idle_balance() is
not needed anymore):

@@ -10338,6 +10343,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
int continue_balancing = 1;
int cpu = rq->cpu;
int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+ int prev_busy = busy;
unsigned long interval;
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
@@ -10394,6 +10400,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
+
+ if (!prev_busy && !need_pull_cfs_task(rq))
+ break;
}
if (need_decay) {
/*

But benchmark results are not good enough compared to RFCv2 patchset.
I would dig more deep into this, thanks.


@@ -10996,9 +11119,9 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

if (sd->flags & SD_BALANCE_NEWIDLE) {

- pulled_task = load_balance(this_cpu, this_rq,
- sd, CPU_NEWLY_IDLE,
- &continue_balancing);
+ pulled_task |= load_balance(this_cpu, this_rq,
+ sd, CPU_NEWLY_IDLE,
+ &continue_balancing);

Why |= ?

This is because I changed the behavior of newidle balance a bit. Vanilla
kernel will quit newidle balance once we got task to run on this rq, no
matter the task is non-idle or not. But after this patch, if there are
overloaded cpus in this LLC, we will try harder on balance until we got
non-idle tasks, which means the balancing would be continue even if now
the cpu is sched_idle.

Thanks & BR,
Abel