Re: [RFC 5/6] sched/fair: Take into account latency nice at wakeup

From: Vincent Guittot
Date: Mon May 02 2022 - 08:31:13 EST


On Mon, 2 May 2022 at 11:54, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>
> Hi Tao,
>
> On Sun, 1 May 2022 at 17:58, Tao Zhou <tao.zhou@xxxxxxxxx> wrote:
> >
> > Hi Vincent,
> >
> > Change to Valentin Schneider's now using mail address.
>
> Thanks
>
> >
> > On Fri, Mar 11, 2022 at 05:14:05PM +0100, Vincent Guittot wrote:
> >
> > > Take into account the nice latency priority of a thread when deciding to
> > > preempt the current running thread. We don't want to provide more CPU
> > > bandwidth to a thread but reorder the scheduling to run latency sensitive
> > > task first whenever possible.
> > >
> > > As long as a thread didn't use its bandwidth, it will be able to preempt
> > > the current thread.
> > >
> > > At the opposite, a thread with a low latency priority will preempt current
> > > thread at wakeup only to keep fair CPU bandwidth sharing. Otherwise it will
> > > wait for the tick to get its sched slice.
> > >
> > > curr vruntime
> > > |
> > > sysctl_sched_wakeup_granularity
> > > <-->
> > > ----------------------------------|----|-----------------------|---------------
> > > | |<--------------------->
> > > | . sysctl_sched_latency
> > > | .
> > > default/current latency entity | .
> > > | .
> > > 1111111111111111111111111111111111|0000|-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-
> > > se preempts curr at wakeup ------>|<- se doesn't preempt curr -----------------
> > > | .
> > > | .
> > > | .
> > > low latency entity | .
> > > ---------------------->|
> > > % of sysctl_sched_latency |
> > > 1111111111111111111111111111111111111111111111111111111111|0000|-1-1-1-1-1-1-1-
> > > preempt ------------------------------------------------->|<- do not preempt --
> > > | .
> > > | .
> > > | .
> > > high latency entity | .
> > > |<-----------------------| .
> > > | % of sysctl_sched_latency .
> > > 111111111|0000|-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1
> > > preempt->|<- se doesn't preempt curr ------------------------------------------
> > >
> > > Tests results of nice latency impact on heavy load like hackbench:
> > >
> > > hackbench -l (2560 / group) -g group
> > > group latency 0 latency 19
> > > 1 1.450(+/- 0.60%) 1.376(+/- 0.54%) + 5%
> > > 4 1.537(+/- 1.75%) 1.335(+/- 1.81%) +13%
> > > 8 1.414(+/- 1.91%) 1.348(+/- 1.88%) + 5%
> > > 16 1.423(+/- 1.65%) 1.374(+/- 0.58%) + 3%
> > >
> > > hackbench -p -l (2560 / group) -g group
> > > group
> > > 1 1.416(+/- 3.45%) 0.886(+/- 0.54%) +37%
> > > 4 1.634(+/- 7.14%) 0.888(+/- 5.40%) +45%
> > > 8 1.449(+/- 2.14%) 0.804(+/- 4.55%) +44%
> > > 16 0.917(+/- 4.12%) 0.777(+/- 1.41%) +15%
> > >
> > > By deacreasing the latency prio, we reduce the number of preemption at
> >
> > s/deacreasing/decreasing/
>
> yes
>
> > s/reduce/increase/
>
> not in the case of hackbench tests above. By decreasing the latency
> prio of all hackbench threads, we make sure that they will not preempt
> the current thread and let it move forward so we reduce the number of
> preemption.
>
> >
> > > wakeup and help hackbench making progress.
> > >
> > > Test results of nice latency impact on short live load like cyclictest
> > > while competing with heavy load like hackbench:
> > >
> > > hackbench -l 10000 -g group &
> > > cyclictest --policy other -D 5 -q -n
> > > latency 0 latency -20
> > > group min avg max min avg max
> > > 0 16 17 28 15 17 27
> > > 1 61 382 10603 63 89 4628
> > > 4 52 437 15455 54 98 16238
> > > 8 56 728 38499 61 125 28983
> > > 16 53 1215 52207 61 183 80751
> > >
> > > group = 0 means that hackbench is not running.
> > >
> > > The avg is significantly improved with nice latency -20 especially with
> > > large number of groups but min and max remain quite similar. If we add the
> > > histogram parameters to get details of latency, we have :
> > >
> > > hackbench -l 10000 -g 16 &
> > > cyclictest --policy other -D 5 -q -n -H 20000 --histfile data.txt
> > > latency 0 latency -20
> > > Min Latencies: 63 62
> > > Avg Latencies: 1397 218
> > > Max Latencies: 44926 42291
> > > 50% latencies: 100 98
> > > 75% latencies: 762 116
> > > 85% latencies: 1118 126
> > > 90% latencies: 1540 130
> > > 95% latencies: 5610 138
> > > 99% latencies: 13738 266
> > >
> > > With percentile details, we see the benefit of nice latency -20 as
> > > 1% of the latencies stays above 266us whereas the default latency has
> > > got 25% are above 762us and 10% over the 1ms.
> > >
>
> [..]
>
> > > +static long wakeup_latency_gran(int latency_weight)
> > > +{
> > > + long thresh = sysctl_sched_latency;
> > > +
> > > + if (!latency_weight)
> > > + return 0;
> > > +
> > > + if (sched_feat(GENTLE_FAIR_SLEEPERS))
> > > + thresh >>= 1;
> > > +
> > > + /*
> > > + * Clamp the delta to stay in the scheduler period range
> > > + * [-sysctl_sched_latency:sysctl_sched_latency]
> > > + */
> > > + latency_weight = clamp_t(long, latency_weight,
> > > + -1 * NICE_LATENCY_WEIGHT_MAX,
> > > + NICE_LATENCY_WEIGHT_MAX);
> > > +
> > > + return (thresh * latency_weight) >> NICE_LATENCY_SHIFT;
> > > +}
> > > +
> > > static unsigned long wakeup_gran(struct sched_entity *se)
> > > {
> > > unsigned long gran = sysctl_sched_wakeup_granularity;
> > > @@ -7008,6 +7059,10 @@ static int
> > > wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > > {
> > > s64 gran, vdiff = curr->vruntime - se->vruntime;
> > > + int latency_weight = se->latency_weight - curr->latency_weight;
> > > +
> > > + latency_weight = min(latency_weight, se->latency_weight);
> >
> > Let lable A=se->latency_weight, B=curr->latency_weight, C=A-B.
> >
> > 1 A>0 B>0
> > ::C=A-B<0, min(C,A)=C, latency decrease, C is the real diff value no matter
> > what A is. That means it can be very 'long'(-sched_latency) and vdiff is
> > more possible to be in <= 0 case and return -1.
> > ::C=A-B>0, min(C,A)=A, latency increase, but it is conservative. Limit to
>
> A>0 and B>0 then min(C=A-B, A) = C
>
> > A/1024*sched_latency.
> > When latecy is decreased, the negtive value added to vdiff is larger than the
> > positive value added to vdiff when latency is increased.
>
> When the weight > 0, the tasks have some latency requirements so we
> use their relative priority to decide if we can preempt current which
> also has some latency requirement
>
> >
> > 2 A>0 B<0
> > ::C=A-B>0 and C>A, min(C,A)=A, latency increase and it is conservative.

For this one we want to use delta like for UC 1 above

> >
> > 3 A<0 B<0
> > ::C=A-B>0, min(C,A)=A, latency increase but min(C,A)<0, I think if latency
> > increase means the value added to vdiff will be a positive value to increase
> > the chance to return 1. I would say it is max(C,A)=C
> > ::C=A-B<0, min(C,A)=A, latency decrease and the real negtive value.
>
> When the weight < 0, the tasks haven't latency requirement and even
> don't care of being scheduled "quickly". In this case, we don't care
> about the relative priority but we want to minimize the preemption of
> current so we use the weight
>
> >
> > 4 A<0,B>0
> > ::C=A-B<0, min(C,A)=C, latency decrease and the real negtive value.

And for this one we probably want to use A like for other UC with A < 0

I'm going to update the way the weight is computed to match this

> >
> > Is there a reason that the value when latency increase and latency decrease
> > be treated differently. Latency increase value is limited to se's latency_weight
>
> I have tried to explain why I treat differently if weight is > 0 or < 0.
>
> > but latency decrease value can extend to -sched_latency or treat them the same.
> > Penalty latency decrease and conserve latency increase.
> >
> >
> > There is any value that this latency_weight can be considered to be a average.
> >
> > The delta value choose is ~%5 to 1024. %5*sched_latency=0.05*6ms=0.3ms.(no scale)
> > I do not think over that vdiff equation and others though.
> >
> > Thanks,
> > Tao
> > > + vdiff += wakeup_latency_gran(latency_weight);
> > >
> > > if (vdiff <= 0)
> > > return -1;
> > > @@ -7117,6 +7172,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > > return;
> > >
> > > update_curr(cfs_rq_of(se));
> > > +
> > > if (wakeup_preempt_entity(se, pse) == 1) {
> > > /*
> > > * Bias pick_next to pick the sched entity that is
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 456ad2159eb1..dd92aa9c36f9 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -122,6 +122,17 @@ extern void call_trace_sched_update_nr_running(struct rq *rq, int count);
> > > * Default tasks should be treated as a task with latency_nice = 0.
> > > */
> > > #define DEFAULT_LATENCY_NICE 0
> > > +#define DEFAULT_LATENCY_PRIO (DEFAULT_LATENCY_NICE + LATENCY_NICE_WIDTH/2)
> > > +
> > > +/*
> > > + * Convert user-nice values [ -20 ... 0 ... 19 ]
> > > + * to static latency [ 0..39 ],
> > > + * and back.
> > > + */
> > > +#define NICE_TO_LATENCY(nice) ((nice) + DEFAULT_LATENCY_PRIO)
> > > +#define LATENCY_TO_NICE(prio) ((prio) - DEFAULT_LATENCY_PRIO)
> > > +#define NICE_LATENCY_SHIFT (SCHED_FIXEDPOINT_SHIFT)
> > > +#define NICE_LATENCY_WEIGHT_MAX (1L << NICE_LATENCY_SHIFT)
> > >
> > > /*
> > > * Increase resolution of nice-level calculations for 64-bit architectures.
> > > @@ -2098,6 +2109,7 @@ static_assert(WF_TTWU == SD_BALANCE_WAKE);
> > >
> > > extern const int sched_prio_to_weight[40];
> > > extern const u32 sched_prio_to_wmult[40];
> > > +extern const int sched_latency_to_weight[40];
> > >
> > > /*
> > > * {de,en}queue flags:
> > > --
> > > 2.17.1
> > >