Re: [PATCH v10 8/9] sched/fair: Add latency list

From: Vincent Guittot
Date: Mon Feb 27 2023 - 09:55:52 EST


On Mon, 27 Feb 2023 at 14:29, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Feb 22, 2023 at 12:16:29PM +0100, Vincent Guittot wrote:
> > On Wed, 22 Feb 2023 at 10:50, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Jan 13, 2023 at 03:12:33PM +0100, Vincent Guittot wrote:
> > >
> > > > +static void __enqueue_latency(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > > > +{
> > > > +
> > > > + /* Only latency sensitive entity can be added to the list */
> > > > + if (se->latency_offset >= 0)
> > > > + return;
> > > > +
> > > > + if (!RB_EMPTY_NODE(&se->latency_node))
> > > > + return;
> > > > +
> > > > + /*
> > > > + * An execution time less than sysctl_sched_min_granularity means that
> > > > + * the entity has been preempted by a higher sched class or an entity
> > > > + * with higher latency constraint.
> > > > + * Put it back in the list so it gets a chance to run 1st during the
> > > > + * next slice.
> > > > + */
> > > > + if (!(flags & ENQUEUE_WAKEUP)) {
> > > > + u64 delta_exec = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > > > +
> > > > + if (delta_exec >= sysctl_sched_min_granularity)
> > > > + return;
> > > > + }
> > >
> > > I'm not a big fan of this dynamic enqueueing condition; it makes it
> > > rather hard to interpret the below addition to pick_next_entity().
> > >
> > > Let me think about this more... at the very least the comment with
> > > __pick_first_latency() use below needs to be expanded upon if we keep it
> > > like so.
> >
> > Only the waking tasks should be added in the latency rb tree so they
>
> But that's what I'm saying, you can game this by doing super short
> sleeps every min_gran.

yes, it 's for a time. But i'm not sure this will be always beneficial
at the end because most of the time you will have your full slice
without doing this game

>
> > can be selected to run 1st (as long as they don't use too much
> > runtime). But task A can wake up, preempts current task B thanks to
> > its latency nice , starts to run few usecs but then is immediately
> > preempted by a RT task C as an example. In this case, we consider that
> > the task A didn't get a chance to run after its wakeup and we put it
> > back to the latency rb tree just as if task A has just woken up but
> > didn't preempted the new current task C.
>
> So ideally, and this is where I'm very slow with thinking, that
> wakeup_preempt_entity() condition here:
>
> > > > @@ -5008,6 +5082,12 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > > > se = cfs_rq->last;
> > > > }
> > > >
> > > > + /* Check for latency sensitive entity waiting for running */
> > > > + latency = __pick_first_latency(cfs_rq);
> > > > + if (latency && (latency != se) &&
> > > > + wakeup_preempt_entity(latency, se) < 1)
> > > > + se = latency;
>
> should be sufficient to provide fair bandwidth usage. The EEVDF paper
> achieves this by selecting the leftmost elegible task, where elegibility
> is dependent on negative lag. Only those tasks that are behind the pack
> are allowed runtime.
>
> Now clearly our min_vruntime is unsuited for that exact scheme, but iirc
> wake_preempt_entity() does not allow for starvation, so we should be
> good, even without that weird condition in __enqueue_latency(), hmm?

If we unconditionally __enqueue_latency() the task then it can end up
providing more bandwidth to those tasks because it's like having a
larger sleep credit than others.

The original condition in __enqueue_latency() was :
if (!(flags & ENQUEUE_WAKEUP)) {
return;
}

So that task gets a chance to preempt others only at wakeup.
But then, I have seen such tasks being preempted immediately but RT
tasks and as a result lost their latency advantage. Maybe I should be
the condition above and add the weird condition in a separate patch
with dedicated figures