Re: [RFC v5 2/9] sched/deadline: improve the tracking of active utilization

From: Luca Abeni
Date: Mon Jul 24 2017 - 04:06:31 EST


Hi Peter,

On Fri, 24 Mar 2017 22:47:15 +0100
luca abeni <luca.abeni@xxxxxxxxxxxxxxx> wrote:

> Hi Peter,
>
> On Fri, 24 Mar 2017 14:20:41 +0100
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > On Fri, Mar 24, 2017 at 04:52:55AM +0100, luca abeni wrote:
> >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index d67eee8..952cac8 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -445,16 +445,33 @@ struct sched_dl_entity {
> > > *
> > > * @dl_yielded tells if task gave up the CPU before
> > > consuming
> > > * all its available runtime during the last job.
> > > + *
> > > + * @dl_non_contending tells if task is inactive while still
> > > + * contributing to the active utilization. In other words,
> > > it
> > > + * indicates if the inactive timer has been armed and its
> > > handler
> > > + * has not been executed yet. This flag is useful to avoid
> > > race
> > > + * conditions between the inactive timer handler and the
> > > wakeup
> > > + * code.
> > > */
> > > int dl_throttled;
> > > int dl_boosted;
> > > int dl_yielded;
> > > + int dl_non_contending;
> >
> > We should maybe look into making those bits :1, not something for this
> > patch though;
>
> Yes, grouping all the flags in a single field was my intention too... I
> planned to submit a patch to do this after merging the reclaiming
> patches... But maybe it is better to do this first :)

I implemented this change, but before submitting the patch I have a
small question.
I implemented some helpers to access the various
{throttled,boosted,yielded,non_contending} flags. I have some
"dl_{throttled,boosted,...}()" inline functions for reading the values
of the flags, and some inline functions for setting / clearing the
flags. For these, I have two possibilities:
- using two separate "dl_set_{throttled,...}()" and
"dl_clear_{throttled,..}()" functions for each flag
- using one single "dl_set_{throttled,...}(dl, value)" function per
flag, in which the flag's value is specified.

I have no preferences (with the first proposal, I introduce more inline
functions, but I think the functions can be made more efficient /
optimized). Which one of the two proposals is preferred? (or, there is
a third, better, idea that I overlooked?)


Thanks,
Luca

>
>
> > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > > index cef9adb..86aed82 100644
> > > --- a/kernel/sched/deadline.c
> > > +++ b/kernel/sched/deadline.c
> > > @@ -65,6 +65,107 @@ void sub_running_bw(u64 dl_bw, struct dl_rq
> > > *dl_rq) dl_rq->running_bw = 0;
> > > }
> > >
> > > +void dl_change_utilization(struct task_struct *p, u64 new_bw)
> > > +{
> > > + if (!task_on_rq_queued(p)) {
> > > + struct rq *rq = task_rq(p);
> > > +
> > > + if (p->dl.dl_non_contending) {
> > > + sub_running_bw(p->dl.dl_bw, &rq->dl);
> > > + p->dl.dl_non_contending = 0;
> > > + /*
> > > + * If the timer handler is currently
> > > running and the
> > > + * timer cannot be cancelled,
> > > inactive_task_timer()
> > > + * will see that dl_not_contending is not
> > > set, and
> > > + * will not touch the rq's active
> > > utilization,
> > > + * so we are still safe.
> > > + */
> > > + if
> > > (hrtimer_try_to_cancel(&p->dl.inactive_timer) == 1)
> > > + put_task_struct(p);
> > > + }
> > > + }
> > > +}
> >
> > If we rearrange that slightly we can avoid the double indent:
> [...]
> Ok; I'll rework the code in this way on Monday.
>
> [...]
> > > + if (hrtimer_try_to_cancel(&dl_se->inactive_timer)
> > > == 1)
> > > + put_task_struct(dl_task_of(dl_se));
> > > + dl_se->dl_non_contending = 0;
> >
> > This had me worried for a little while as being a use-after-free, but
> > this put_task_struct() cannot be the last in this case. Might be
> > worth a comment.
>
> Or maybe it is better to move "dl_se->dl_non_contending = 0;" before
> hrtimer_try_to_cancel()?
>
> >
> > > + } else {
> > > + /*
> > > + * Since "dl_non_contending" is not set, the
> > > + * task's utilization has already been removed from
> > > + * active utilization (either when the task
> > > blocked,
> > > + * when the "inactive timer" fired).
> > > + * So, add it back.
> > > + */
> > > + add_running_bw(dl_se->dl_bw, dl_rq);
> > > + }
> > > +}
> >
> > In general I feel it would be nice to have a state diagram included
> > somewhere near these two functions. It would be nice to not have to
> > dig out the PDF every time.
>
> Ok... Since I am not good at ascii art, would it be ok to add a textual
> description? If yes, I'll add a comment like:
> "
> The utilization of a task is added to the runqueue's active utilization
> when the task becomes active (is enqueued in the runqueue), and is
> removed when the task becomes inactive. A task does not become
> immediately inactive when it blocks, but becomes inactive at the so
> called "0 lag time"; so, we setup the "inactive timer" to fire at the
> "0 lag time". When the "inactive timer" fires, the task utilization is
> removed from the runqueue's active utilization. If the task wakes up
> again on the same runqueue before the "0 lag time", the active
> utilization must not be changed and the "inactive timer" must be
> cancelled. If the task wakes up again on a different runqueue before
> the "0 lag time", then the task's utilization must be removed from the
> previous runqueue's active utilization and must be added to the new
> runqueue's active utilization.
> In order to avoid races between a task waking up on a runqueue while the
> "inactive timer" is running on a different CPU, the "dl_non_contending"
> flag is used to indicate that a task is not on a runqueue but is active
> (so, the flag is set when the task blocks and is cleared when the
> "inactive timer" fires or when the task wakes up).
> "
> (if this is ok, where can I add this comment?)
>
>
> > > +static void migrate_task_rq_dl(struct task_struct *p)
> > > +{
> > > + if ((p->state == TASK_WAKING) &&
> > > (p->dl.dl_non_contending)) {
> > > + struct rq *rq = task_rq(p);
> > > +
> > > + raw_spin_lock(&rq->lock);
> > > + sub_running_bw(p->dl.dl_bw, &rq->dl);
> > > + p->dl.dl_non_contending = 0;
> > > + /*
> > > + * If the timer handler is currently running and
> > > the
> > > + * timer cannot be cancelled, inactive_task_timer()
> > > + * will see that dl_not_contending is not set, and
> > > + * will not touch the rq's active utilization,
> > > + * so we are still safe.
> > > + */
> > > + if (hrtimer_try_to_cancel(&p->dl.inactive_timer)
> > > == 1)
> > > + put_task_struct(p);
> > > +
> > > + raw_spin_unlock(&rq->lock);
> > > + }
> > > +}
> >
> > This can similarly be reduced in indent, albeit this is only a single
> > indent level, so not as onerous as the other one.
>
> Ok; I'll do this on Monday.
>
>
> > What had me puzzled for a while here is taking the lock; because some
> > callers of set_task_cpu() must in fact hold this lock already. Worth a
> > comment I feel.
>
> Ok; I'll add a comment mentioning that since state is TASK_WAKING we do
> not have the lock.
>
>
> > Once I figured out the exact locking; I realized you do this on
> > cross-cpu wakeups. We spend a fair amount of effort not to take both
> > rq locks. But I suppose you have to here.
>
> The problem is that when a task wakes up before the "0 lag time" on a
> different runqueue, we must "migrate" its utilization from the old
> runqueue to the new one (see comment above). And I need the lock to
> avoid races... The only alternative I can think about is to ad a new
> lock protecting the active utilization, and to take it instead of the
> runqueue lock (I do not know if this is enough, I need to check...).
> I'll have a look on Monday.
>
>
>
> Thanks,
> Luca