Re: [RFC v2 1/2] sched/fair: filter out overloaded cpus in SIS

From: Abel Wu
Date: Thu Apr 14 2022 - 12:07:47 EST


On 4/14/22 7:49 AM, Josh Don Wrote:
On Tue, Apr 12, 2022 at 10:55 AM Abel Wu <wuyun.abel@xxxxxxxxxxxxx> wrote:

On 4/12/22 9:23 AM, Josh Don Wrote:

static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int i, cpu, idle_cpu = -1, nr = INT_MAX;
+ int i, cpu, idle_cpu = -1, nr = INT_MAX, nro;
struct rq *this_rq = this_rq();
int this = smp_processor_id();
struct sched_domain *this_sd;
@@ -6301,7 +6310,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
if (!this_sd)
return -1;

+ nro = atomic_read(&sd->shared->nr_overloaded);
+ if (sched_domain_overloaded(sd, nro))
+ return -1;

This early bail out doesn't seem to be related to the main idea of
your patch. Apart from deciding the exact heuristic value for what is

I agree that this early check doesn't seem to have a strong bound with
the idea "filter out the overloaded cpus", but this check is aligned
with the goal of "search less when becoming more overloaded".

As to the heuristic value, which is about 95%, I think it would be nice
if I can show more test results? I also have tested sd->imbalance_pct
and 100% (nro == sd->span_weight), seems like 95% is a better choice.

considered too unlikely to find an idle cpu, this doesn't work well
with tasks constrained by affinity; a task may have a small affinity
mask containing idle cpus it may wake onto, even if most of the node
is overloaded.

Yes, indeed. And I haven't come to a solution except that remove this
check entirely. Ideas?

Does this check help that much? Given that you added the filter below
to cut out searching overloaded cpus, I would think that the below is
sufficient.

I see a ~10% performance drop in the higher load part of the hackbench
and tbench without this check, in which cases system is quite overloaded
and idle cpus can hardly exist.


Another use case that would break with the above:

A few cpus are reserved for a job, so that it always has a couple cpus
dedicated to it. It can run across the entire machine though (no
affinity restriction). If the rest of the machine is very busy, we'd
still want to be able to search for and find the idle reserved cpus
for the job.

Yes, this could be true if very few cpus are reserved for the job. Along
with the previous affinity case, I think the following might help both:

static inline bool
sched_domain_overloaded(struct sched_domain *sd, int nr_overloaded)
{
return nr_overloaded == sd->span_weight;
}

Besides, I think sched_idle_balance() will work well on this case.


cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+ if (nro)
+ cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));

To prevent us from exhausting our search attempts too quickly, this
only needs to go under the sched_feat(SIS_PROP) && !has_idle_core case
below. But by doing this unconditionally here, I guess your secondary
goal is to reduce total search cost in both cases. Just wondering, did

Yes, it's unnecessary to try the overloaded cpus. But this makes sense
only if the overloaded cpumask is relatively accurate as you pointed
out below.

you observe significant time spent here that you are trying to
optimize? By reducing our search space by the overload mask, it is
important that the mask is relatively up to date, or else we could
miss an opportunity to find an idle cpu.

I think that's why Mel asked for the SIS statistics. The result in the
cover letter shows improvement on the search efficiency, and that is
what the overhead of the cpumask calculation trade for. Would it be
better if skip the update when nro is small?

Just pointing out that with a very fast wake/sleep rate, you could hit
cases where you potentially fail to consider waking onto a cpu that is
actually idle. But I think this concern is addressed below.


if (sched_feat(SIS_PROP) && !has_idle_core) {
u64 avg_cost, avg_idle, span_avg;
@@ -7018,6 +7033,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)

return newidle_balance(rq, rf) != 0;
}
+
+static inline bool cfs_rq_overloaded(struct rq *rq)
+{
+ return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
+}

Why > 1 instead of > 0? If a cpu is running 1 non-idle task and any
number of idle tasks, I'd think it is still "occupied" in the way
you've defined. We'd want to steer wakeups to cpus running 0 non-idle
tasks.

The idea behind "> 1" is telling the unoccupied cpus to pull non-idle
tasks from it (in the next patch). Although "> 0" is more efficient in
wakeup, it blinds us when pulling tasks.

Ok, I was a bit confused because there are different considerations
for >=1 and >=2 non-idle tasks.

So you consider >= 2 non-idle tasks to be "overloaded". TBH I do
prefer this than ">=1" for the wakeup filtering anyway, because if
there are at least two tasks, that makes it less likely for us to race
with seeing a spurious wakeup/sleep causing a cpu to be fully
idle/non-idle (ie. we have more confidence that we can safely filter
out the overload mask).

Agreed.


Why are these cpu mask writes not atomic?

They are atomic. The non-atomic version is __cpumask_{set,clear}_cpu.
Did I miss something?

Ah, I confused these, my bad.

I'd caution about using task_tick and pick_next_task_fair as the
places we set and clear overload.

Some issues with task_tick:
- ticks may be disabled in NO_HZ_FULL (an issue if we define overload
as > 0 non-idle tasks)
- most ticks will have the same state, so somewhat redundant checking.
Could use an edge based trigger instead, such as enqueue/dequeue
(somewhat similar to rq->rd->overload).

1. In NO_HZ_FULL, given rq is overloaded, say have non-idle task A and
B enqueued, if A is dequeued before next tick then tick will be off
and the rq will keep "overloaded" while it's actually not. But this
doesn't necessarily be a bad thing because this cpu will be skipped
in wakeup path which helps in improving searching efficiency.

Yea this concern is alleviated because overload is actually >=2 tasks
(I had been incorrectly assuming you wanted to mark overload for >=1
non-idle task.

2. Yes, that's why I use rq->overloaded to save the last update state.
So when the overloaded state doesn't change, what we all do is a
simple check on a local variable.
The enqueue/dequeue path is not well bounded, and it could be very
frequent on short running workloads, which would introduce great
overhead to update the LLC shared atomic/cpumask.

Yea, the frequent update would be an issue. I now see the check on the
cpu-local variable.

So the rate limit on updates comes from the fact that
!overloaded->overloaded requires a tick. We can quickly go from
overloaded->!overloaded, but will take another tick until we can go
back to overloaded.

From the SIS's point of view, yes. But sched_idle_balance() will help
in overloaded->!overloaded transition, since an update will be issued
on the overloaded cpu after pulling. This tendency towards !overload
makes overload bits more reliable than the !overload bits in cpumask,
and when comes to the worst case, that is all bits are !overload, it
just fallback to the original SIS code.


With pick_next_task_fair():
- there's a window between a thread dequeuing, and then scheduler
running through to the end of pick_next_task_fair(), during which we
falsely observe the cpu as overloaded
- this breaks with core scheduling, since we may use pick_task_fair
rather than pick_next_task_fair

1. I'm afraid I don't understand what the problem is, can you explain
more on this? Thanks.

Can ignore this comment, I don't think it is relevant given that this
isn't really a regression vs. the latency between the last thread
dequeuing and available_idle_cpu() returning true.

2. Nice catch, I will fix it in next update. (Maybe by updating the
overloaded status in do_idle()?)

Ideally can catch it before we actually switch to rq->idle (just
trying to minimize latency to mark as !overloaded).

Makes sense.

Thanks & BR,
Abel