Re: [bug-report] possible s64 overflow in max_vruntime()

From: Peter Zijlstra
Date: Thu Jan 26 2023 - 07:50:15 EST


On Wed, Jan 25, 2023 at 08:45:32PM +0100, Roman Kagan wrote:

> The calculation is indeed safe against the overflow of the vruntimes
> themselves. However, when the two vruntimes are more than 2^63 apart,
> their comparison gets inverted due to that s64 overflow.

Yes, but that's a whole different issue. vruntime are not expected to be
*that* far apart.

That is surely the abnormal case. The normal case is wrap around, and
that happens 'often' and should continue working.

> And this is what happens here: one scheduling entity has accumulated a
> vruntime more than 2^63 ahead of another. Now the comparison is
> inverted due to s64 overflow, and the latter can't get to the cpu,
> because it appears to have vruntime (much) bigger than that of the
> former.

If it can be 2^63 ahead, it can also be 2^(64+) ahead and nothing will
help.

> This situation is reproducible e.g. when one scheduling entity is a
> multi-cpu hog, and the other is woken up from a long sleep. Normally

A very low weight CPU hog?

> when a task is placed on a cfs_rq, its vruntime is pulled to
> min_vruntime, to avoid boosting the woken up task. However in this case
> the task is so much behind in vruntime that it appears ahead instead,
> its vruntime is not adjusted in place_entity(), and then it looses the
> cpu to the current scheduling entity.

What I think might be a way out here is passing the the sleep wall-time
(cfs_rq_clock_pelt() time I suppose) to place entity and simply skip the
magic if 'big'.

All that only matters for small sleeps anyway.

Something like:

sleep_time = U64_MAX;
if (se->avg.last_update_time)
sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;

if (sleep_time > 60*NSEC_PER_SEC) { // 1 minute is huge
se->vruntime = cfs_rq->min_vruntime;
return;
}

// ... rest of place_entity()

Hmm... ?