Re: [bug-report] possible s64 overflow in max_vruntime()

From: Roman Kagan
Date: Thu Jan 26 2023 - 13:31:29 EST


On Thu, Jan 26, 2023 at 01:49:43PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 25, 2023 at 08:45:32PM +0100, Roman Kagan wrote:
>
> > The calculation is indeed safe against the overflow of the vruntimes
> > themselves. However, when the two vruntimes are more than 2^63 apart,
> > their comparison gets inverted due to that s64 overflow.
>
> Yes, but that's a whole different issue. vruntime are not expected to be
> *that* far apart.
>
> That is surely the abnormal case. The normal case is wrap around, and
> that happens 'often' and should continue working.
>
> > And this is what happens here: one scheduling entity has accumulated a
> > vruntime more than 2^63 ahead of another. Now the comparison is
> > inverted due to s64 overflow, and the latter can't get to the cpu,
> > because it appears to have vruntime (much) bigger than that of the
> > former.
>
> If it can be 2^63 ahead, it can also be 2^(64+) ahead and nothing will
> help.
>
> > This situation is reproducible e.g. when one scheduling entity is a
> > multi-cpu hog, and the other is woken up from a long sleep. Normally
>
> A very low weight CPU hog?

Right. In our case this weight was due to the task group consuming
all 448 cpus on the machine; presumably one can achive this on a smaller
machine by tweaking shares of the cgroup.

> > when a task is placed on a cfs_rq, its vruntime is pulled to
> > min_vruntime, to avoid boosting the woken up task. However in this case
> > the task is so much behind in vruntime that it appears ahead instead,
> > its vruntime is not adjusted in place_entity(), and then it looses the
> > cpu to the current scheduling entity.
>
> What I think might be a way out here is passing the the sleep wall-time
> (cfs_rq_clock_pelt() time I suppose) to place entity and simply skip the
> magic if 'big'.
>
> All that only matters for small sleeps anyway.
>
> Something like:
>
> sleep_time = U64_MAX;
> if (se->avg.last_update_time)
> sleep_time = cfs_rq_clock_pelt(cfs_rq) - se->avg.last_update_time;

Interesting, why not rq_clock_task(rq_of(cfs_rq)) - se->exec_start, as
others were suggesting? It appears to better match the notion of sleep
wall-time, no?

Thanks,
Roman.

>
> if (sleep_time > 60*NSEC_PER_SEC) { // 1 minute is huge
> se->vruntime = cfs_rq->min_vruntime;
> return;
> }
>
> // ... rest of place_entity()
>
> Hmm... ?



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879