Re: [PATCH v5 2/2] sched/fair: Introduce SIS_SHORT to wake up short task on current CPU

From: Chen Yu
Date: Sun Feb 19 2023 - 23:58:52 EST


On 2023-02-17 at 16:35:24 +0800, Honglei Wang wrote:
>
>
> On 2023/2/16 20:55, Abel Wu wrote:
> > Hi Chen,
> >
> > I've tested this patchset (with modification) on our Redis proxy
> > servers, and the results seems promising.
> >
> > On 2/3/23 1:18 PM, Chen Yu wrote:
> > > ...
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index aa16611c7263..d50097e5fcc1 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6489,6 +6489,20 @@ static int wake_wide(struct task_struct *p)
> > >       return 1;
> > >   }
> > > +/*
> > > + * If a task switches in and then voluntarily relinquishes the
> > > + * CPU quickly, it is regarded as a short duration task.
> > > + *
> > > + * SIS_SHORT tries to wake up the short wakee on current CPU. This
> > > + * aims to avoid race condition among CPUs due to frequent context
> > > + * switch.
> > > + */
> > > +static inline int is_short_task(struct task_struct *p)
> > > +{
> > > +    return sched_feat(SIS_SHORT) && p->se.dur_avg &&
> > > +           ((p->se.dur_avg * 8) < sysctl_sched_min_granularity);
> > > +}
> >
> > I changed the factor to fit into the shape of tasks in question.
> >
> >     static inline int is_short_task(struct task_struct *p)
> >     {
> >         u64 dur = sysctl_sched_min_granularity / 8;
> >
> >         if (!sched_feat(SIS_SHORT) || !p->se.dur_avg)
> >             return false;
> >
> >         /*
> >          * Bare tracepoint to allow dynamically changing
> >          * the threshold.
> >          */
> >         trace_sched_short_task_tp(p, &dur);
> >
> >         return p->se.dur_avg < dur;
> >     }
> >
> > I'm not sure it is the right way to provide such flexibility, but
> > definition of 'short' can be workload specific.
> >
> > > +
> > >   /*
> > >    * The purpose of wake_affine() is to quickly determine on which
> > > CPU we can run
> > >    * soonest. For the purpose of speed we only consider the waking
> > > and previous
> > > @@ -6525,6 +6539,11 @@ wake_affine_idle(int this_cpu, int prev_cpu,
> > > int sync)
> > >       if (available_idle_cpu(prev_cpu))
> > >           return prev_cpu;
> > > +    /* The only running task is a short duration one. */
> > > +    if (cpu_rq(this_cpu)->nr_running == 1 &&
> > > +        is_short_task(rcu_dereference(cpu_curr(this_cpu))))
> > > +        return this_cpu;
> >
> > Since proxy server handles simple data delivery, the tasks are
> > generally short running ones and hate task stacking which may
> > introduce scheduling latency (even there are only 2 short tasks
> > competing each other). So this part brings slight regression on
> > the proxy case. But I still think this is good for most cases.
> >
> > Speaking of task stacking, I found wake_affine_weight() can be
> > much more dangerous. It chooses the less loaded one between the
> > prev & this cpu as a candidate, so 'small' tasks can be easily
> > stacked on this cpu when wake up several tasks at one time if
> > this cpu is unloaded. This really hurts if the 'small' tasks are
> > latency-sensitive, although wake_affine_weight() does the right
> > thing from the point of view of 'load'.
> >
> > The following change greatly reduced the p99lat of Redis service
> > from 150ms to 0.9ms, at exactly the same throughput (QPS).
> >
> > @@ -5763,6 +5787,9 @@ wake_affine_weight(struct sched_domain *sd, struct
> > task_struct *p,
> >     s64 this_eff_load, prev_eff_load;
> >     unsigned long task_load;
> >
> > +    if (is_short_task(p))
> > +        return nr_cpumask_bits;
> > +
> >     this_eff_load = cpu_load(cpu_rq(this_cpu));
> >
> >     if (sync) {
> >
> > I know that 'short' tasks are not necessarily 'small' tasks, e.g.
> > sleeping duration is small or have large weights, but this works
> > really well for this case. This is partly because delivering data
> > is memory bandwidth intensive hence prefer cache hot cpus. And I
> > think this is also applicable to the general purposes: do NOT let
> > the short running tasks suffering from cache misses caused by
> > migration.
> >
>
> Redis is a bit special. It runs quick and really sensitive on schedule
> latency. The purpose of this 'short task' feature from Yu is to mitigate the
> migration and tend to place the waking task on local cpu, this is somehow on
> the opposite side of workload such as Redis. The changes you did remind me
> of the latency-prio stuff. Maybe we can do something base on both the 'short
> task' and 'latency-prio' to make your changes more general. thoughts?
>
Looks reasonable, I suppose you were refering to 'latency nice' proposed by
Vincent. For now I'd like to keep this patch simple enough, later we can
extend it.

thanks,
Chenyu
> Thanks,
> Honglei
>
> > Best regards,
> >     Abel