Re: [PATCH v5 05/10] cpufreq/schedutil: get max utilization

From: Vincent Guittot
Date: Thu May 31 2018 - 09:02:33 EST


On 31 May 2018 at 12:27, Patrick Bellasi <patrick.bellasi@xxxxxxx> wrote:
>
> Hi Vincent, Juri,
>
> On 28-May 18:34, Vincent Guittot wrote:
>> On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@xxxxxxxxxx> wrote:
>> > On 28/05/18 16:57, Vincent Guittot wrote:
>> >> Hi Juri,
>> >>
>> >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@xxxxxxxxxx> wrote:
>> >> > Hi Vincent,
>> >> >
>> >> > On 25/05/18 15:12, Vincent Guittot wrote:
>> >> >> Now that we have both the dl class bandwidth requirement and the dl class
>> >> >> utilization, we can use the max of the 2 values when agregating the
>> >> >> utilization of the CPU.
>> >> >>
>> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>> >> >> ---
>> >> >> kernel/sched/sched.h | 6 +++++-
>> >> >> 1 file changed, 5 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> >> index 4526ba6..0eb07a8 100644
>> >> >> --- a/kernel/sched/sched.h
>> >> >> +++ b/kernel/sched/sched.h
>> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>> >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
>> >> >> static inline unsigned long cpu_util_dl(struct rq *rq)
>> >> >> {
>> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
>> >> >
>> >> > I'd be tempted to say the we actually want to cap to this one above
>> >> > instead of using the max (as you are proposing below) or the
>> >> > (theoretical) power reduction benefits of using DEADLINE for certain
>> >> > tasks might vanish.
>> >>
>> >> The problem that I'm facing is that the sched_entity bandwidth is
>> >> removed after the 0-lag time and the rq->dl.running_bw goes back to
>> >> zero but if the DL task has preempted a CFS task, the utilization of
>> >> the CFS task will be lower than reality and schedutil will set a lower
>> >> OPP whereas the CPU is always running.
>
> With UTIL_EST enabled I don't expect an OPP reduction below the
> expected utilization of a CFS task.

I'm not sure to fully catch what you mean but all tests that I ran,
are using util_est (which is enable by default if i'm not wrong). So
all OPP drops that have been observed, were with util_est

>
> IOW, when a periodic CFS task is preempted by a DL one, what we use
> for OPP selection once the DL task is over is still the estimated
> utilization for the CFS task itself. Thus, schedutil will eventually
> (since we have quite conservative down scaling thresholds) go down to
> the right OPP to serve that task.
>
>> >> The example with a RT task described in the cover letter can be
>> >> run with a DL task and will give similar results.
>
> In the cover letter you says:
>
> A rt-app use case which creates an always running cfs thread and a
> rt threads that wakes up periodically with both threads pinned on
> same CPU, show lot of frequency switches of the CPU whereas the CPU
> never goes idles during the test.
>
> I would say that's a quite specific corner case where your always
> running CFS task has never accumulated a util_est sample.
>
> Do we really have these cases in real systems?

My example is voluntary an extreme one because it's easier to
highlight the problem

>
> Otherwise, it seems to me that we are trying to solve quite specific
> corner cases by adding a not negligible level of "complexity".

By complexity, do you mean :

Taking into account the number cfs running task to choose between
rq->dl.running_bw and avg_dl.util_avg

I'm preparing a patchset that will provide the cfs waiting time in
addition to dl/rt util_avg for almost no additional cost. I will try
to sent the proposal later today

>
> Moreover, I also have the impression that we can fix these
> use-cases by:
>
> - improving the way we accumulate samples in util_est
> i.e. by discarding preemption time
>
> - maybe by improving the utilization aggregation in schedutil to
> better understand DL requirements
> i.e. a 10% utilization with a 100ms running time is way different
> then the same utilization with a 1ms running time
>
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi