Re: [PATCH v2] sched: Consolidate cpufreq updates

From: Vincent Guittot
Date: Tue May 07 2024 - 06:21:44 EST


On Tue, 7 May 2024 at 10:58, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
>
> On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@xxxxxxxxxxx> wrote:
> >
> > Improve the interaction with cpufreq governors by making the
> > cpufreq_update_util() calls more intentional.
> >
> > At the moment we send them when load is updated for CFS, bandwidth for
> > DL and at enqueue/dequeue for RT. But this can lead to too many updates
> > sent in a short period of time and potentially be ignored at a critical
> > moment due to the rate_limit_us in schedutil.
> >
> > For example, simultaneous task enqueue on the CPU where 2nd task is
> > bigger and requires higher freq. The trigger to cpufreq_update_util() by
> > the first task will lead to dropping the 2nd request until tick. Or
> > another CPU in the same policy triggers a freq update shortly after.
> >
> > Updates at enqueue for RT are not strictly required. Though they do help
> > to reduce the delay for switching the frequency and the potential
> > observation of lower frequency during this delay. But current logic
> > doesn't intentionally (at least to my understanding) try to speed up the
> > request.
> >
> > To help reduce the amount of cpufreq updates and make them more
> > purposeful, consolidate them into these locations:
> >
> > 1. context_switch()
>
> I don't see any cpufreq update when switching from idle to CFS. We
> have to wait for the next tick to get a freq update whatever the value
> of util_est and uclamp

This seems to happen when the tick is not stopped

>
> > 2. task_tick_fair()
>
> Updating only during tick is ok with a tick at 1000hz/1000us when we
> compare it with the1048us slice of pelt but what about 4ms or even
> 10ms tick ? we can have an increase of almost 200 in 10ms
>
> > 3. {attach, detach}_entity_load_avg()
>
> At enqueue/dequeue, the util_est will be updated and can make cpu
> utilization quite different especially with long sleeping tasks. The
> same applies for uclamp_min/max hints of a newly enqueued task. We
> might end up waiting 4/10ms depending of the tick period.
>
> > 4. update_blocked_averages()
> >
> > The update at context switch should help guarantee that DL and RT get
> > the right frequency straightaway when they're RUNNING. As mentioned
> > though the update will happen slightly after enqueue_task(); though in
> > an ideal world these tasks should be RUNNING ASAP and this additional
> > delay should be negligible. For fair tasks we need to make sure we send
> > a single update for every decay for the root cfs_rq. Any changes to the
> > rq will be deferred until the next task is ready to run, or we hit TICK.
> > But we are guaranteed the task is running at a level that meets its
> > requirements after enqueue.
> >
> > To guarantee RT and DL tasks updates are never missed, we add a new
> > SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
> > already running at the right freq, the governor will end up doing
> > nothing, but we eliminate the risk of the task ending up accidentally
> > running at the wrong freq due to rate_limit_us.
> >
> > Similarly for iowait boost, we ignore rate limits. We also handle a case
> > of a boost reset prematurely by adding a guard in sugov_iowait_apply()
> > to reduce the boost after 1ms which seems iowait boost mechanism relied
> > on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
> > after iowait boost.
> >
> > The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
> > time stamps otherwise we can end up delaying updates for normal
> > requests.
> >
> > As a simple optimization, we avoid sending cpufreq updates when
> > switching from RT to another RT as RT tasks run at max freq by default.
> > If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
> > uclamp_min is different to avoid unnecessary cpufreq update as most RT
> > tasks are likely to be running at the same performance level, so we can
> > avoid unnecessary overhead of forced updates when there's nothing to do.
> >
> > We also also ensure to ignore cpufreq udpates for sugov workers at
> > context switch. It doesn't make sense for the kworker that applies the
> > frequency update (which is a DL task) to trigger a frequency update
> > itself.
> >
> > The update at task_tick_fair will guarantee that the governor will
> > follow any updates to load for tasks/CPU or due to new enqueues/dequeues
> > to the rq. Since DL and RT always run at constant frequencies and have
> > no load tracking, this is only required for fair tasks.
> >
> > The update at attach/detach_entity_load_avg() will ensure we adapt to
> > big changes when tasks are added/removed from cgroups.
> >
> > The update at update_blocked_averages() will ensure we decay frequency
> > as the CPU becomes idle for long enough.
> >
> > Results of
> >
> > taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe
> >
> > on AMD 3900X to verify any potential overhead because of the addition at
> > context switch against v6.8.7 stable kernel
> >
> > v6.8.7: schedutil:
> > ------------------
> >
> > Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> > 850,276,689 cycles:u # 0.078 GHz ( +- 0.88% )
> > 82,724,245 instructions:u # 0.10 insn per cycle ( +- 0.00% )
> > 10,881.41 msec task-clock:u # 0.995 CPUs utilized ( +- 0.12% )
> >
> > 10.9377 +- 0.0135 seconds time elapsed ( +- 0.12% )
> >
> > v6.8.7: performance:
> > --------------------
> >
> > Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> > 874,154,415 cycles:u # 0.080 GHz ( +- 0.78% )
> > 82,724,420 instructions:u # 0.10 insn per cycle ( +- 0.00% )
> > 10,916.47 msec task-clock:u # 0.999 CPUs utilized ( +- 0.09% )
> >
> > 10.9308 +- 0.0100 seconds time elapsed ( +- 0.09% )
> >
> > v6.8.7+patch: schedutil:
> > ------------------------
> >
> > Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> > 816,938,281 cycles:u # 0.075 GHz ( +- 0.84% )
> > 82,724,163 instructions:u # 0.10 insn per cycle ( +- 0.00% )
> > 10,907.62 msec task-clock:u # 1.004 CPUs utilized ( +- 0.11% )
> >
> > 10.8627 +- 0.0121 seconds time elapsed ( +- 0.11% )
> >
> > v6.8.7+patch: performance:
> > --------------------------
> >
> > Performance counter stats for 'perf bench sched pipe' (10 runs):
> >
> > 814,038,416 cycles:u # 0.074 GHz ( +- 1.21% )
> > 82,724,356 instructions:u # 0.10 insn per cycle ( +- 0.00% )
> > 10,886.69 msec task-clock:u # 0.996 CPUs utilized ( +- 0.17% )
> >
> > 10.9298 +- 0.0181 seconds time elapsed ( +- 0.17% )
> >
> > Note worthy that we still have the following race condition on systems
> > that have shared policy:
> >
> > * CPUs with shared policy can end up sending simultaneous cpufreq
> > updates requests where the 2nd one will be unlucky and get blocked by
> > the rate_limit_us (schedutil).
> >
> > We can potentially address this limitation later, but it is out of the
> > scope of this patch.
> >
> > Signed-off-by: Qais Yousef <qyousef@xxxxxxxxxxx>
> > ---
> >
> > Changes since v1:
> >
> > * Use taskset and measure with performance governor as Ingo suggested
> > * Remove the static key as I found out we always register a function
> > for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it
> > trigger a lock debug warning.
> > * Improve detection of sugov workers by using SCHED_FLAG_SUGOV
> > * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely
> > reducing iowait boost as the latter was a NOP and like
> > sugov_iowait_reset() like Christian pointed out.
> >
> > v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@xxxxxxxxxxx/
> >
> > include/linux/sched/cpufreq.h | 3 +-
> > kernel/sched/core.c | 68 +++++++++++++++++++++++++++++++-
> > kernel/sched/cpufreq_schedutil.c | 55 +++++++++++++++++++-------
> > kernel/sched/deadline.c | 4 --
> > kernel/sched/fair.c | 53 ++++---------------------
> > kernel/sched/rt.c | 8 +---
> > kernel/sched/sched.h | 5 +++
> > 7 files changed, 122 insertions(+), 74 deletions(-)
> >
> > diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> > index bdd31ab93bc5..2d0a45aba16f 100644
> > --- a/include/linux/sched/cpufreq.h
> > +++ b/include/linux/sched/cpufreq.h
> > @@ -8,7 +8,8 @@
> > * Interface between cpufreq drivers and the scheduler:
> > */
> >
> > -#define SCHED_CPUFREQ_IOWAIT (1U << 0)
> > +#define SCHED_CPUFREQ_IOWAIT (1U << 0)
> > +#define SCHED_CPUFREQ_FORCE_UPDATE (1U << 1) /* ignore transition_delay_us */
> >
> > #ifdef CONFIG_CPU_FREQ
> > struct cpufreq_policy;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a914388144a..e6fe7dbd1f89 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -5134,6 +5134,65 @@ static inline void balance_callbacks(struct rq *rq, struct balance_callback *hea
> >
> > #endif
> >
> > +static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> > +{
> > +#ifdef CONFIG_CPU_FREQ
> > + unsigned int flags = 0;
> > +
> > +#ifdef CONFIG_SMP
> > + if (unlikely(current->sched_class == &stop_sched_class))
> > + return;
> > +#endif
> > +
> > + if (unlikely(current->sched_class == &idle_sched_class))
> > + return;
> > +
> > + if (unlikely(task_has_idle_policy(current)))
> > + return;
> > +
> > + if (likely(fair_policy(current->policy))) {
> > +
> > + if (unlikely(current->in_iowait)) {
> > + flags |= SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE;
> > + goto force_update;
> > + }
> > +
> > +#ifdef CONFIG_SMP
> > + /*
> > + * Allow cpufreq updates once for every update_load_avg() decay.
> > + */
> > + if (unlikely(rq->cfs.decayed)) {
> > + rq->cfs.decayed = false;
> > + goto force_update;
> > + }
> > +#endif
> > + return;
> > + }
> > +
> > + /*
> > + * RT and DL should always send a freq update. But we can do some
> > + * simple checks to avoid it when we know it's not necessary.
> > + */
> > + if (rt_task(current) && rt_task(prev)) {
> > +#ifdef CONFIG_UCLAMP_TASK
> > + unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> > + unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> > +
> > + if (curr_uclamp_min == prev_uclamp_min)
> > +#endif
> > + return;
> > + } else if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
> > + /* Ignore sugov kthreads, they're responding to our requests */
> > + return;
> > + }
> > +
> > + flags |= SCHED_CPUFREQ_FORCE_UPDATE;
> > +
> > +force_update:
> > + cpufreq_update_util(rq, flags);
> > +#endif
> > +}
> > +
> > static inline void
> > prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
> > {
> > @@ -5151,7 +5210,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
> > #endif
> > }
> >
> > -static inline void finish_lock_switch(struct rq *rq)
> > +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
> > {
> > /*
> > * If we are tracking spinlock dependencies then we have to
> > @@ -5160,6 +5219,11 @@ static inline void finish_lock_switch(struct rq *rq)
> > */
> > spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
> > __balance_callbacks(rq);
> > + /*
> > + * Request freq update after __balance_callbacks to take into account
> > + * any changes to rq.
> > + */
> > + update_cpufreq_ctx_switch(rq, prev);
> > raw_spin_rq_unlock_irq(rq);
> > }
> >
> > @@ -5278,7 +5342,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
> > perf_event_task_sched_in(prev, current);
> > finish_task(prev);
> > tick_nohz_task_switch();
> > - finish_lock_switch(rq);
> > + finish_lock_switch(rq, prev);
> > finish_arch_post_lock_switch();
> > kcov_finish_switch(current);
> > /*
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index eece6244f9d2..e8b65b75e7f3 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -59,7 +59,8 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> >
> > /************************ Governor internals ***********************/
> >
> > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
> > + unsigned int flags)
> > {
> > s64 delta_ns;
> >
> > @@ -87,13 +88,16 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> > return true;
> > }
> >
> > + if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> > + return true;
> > +
> > delta_ns = time - sg_policy->last_freq_update_time;
> >
> > return delta_ns >= sg_policy->freq_update_delay_ns;
> > }
> >
> > static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> > - unsigned int next_freq)
> > + unsigned int next_freq, unsigned int flags)
> > {
> > if (sg_policy->need_freq_update)
> > sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
> > @@ -101,7 +105,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> > return false;
> >
> > sg_policy->next_freq = next_freq;
> > - sg_policy->last_freq_update_time = time;
> > +
> > + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> > + sg_policy->last_freq_update_time = time;
> >
> > return true;
> > }
> > @@ -249,9 +255,10 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> > unsigned int flags)
> > {
> > bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> > + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
> >
> > /* Reset boost if the CPU appears to have been idle enough */
> > - if (sg_cpu->iowait_boost &&
> > + if (sg_cpu->iowait_boost && !forced_update &&
> > sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
> > return;
> >
> > @@ -294,17 +301,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> > * being more conservative on tasks which does sporadic IO operations.
> > */
> > static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> > - unsigned long max_cap)
> > + unsigned long max_cap, unsigned int flags)
> > {
> > + bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
> > + s64 delta_ns = time - sg_cpu->last_update;
> > +
> > /* No boost currently required */
> > if (!sg_cpu->iowait_boost)
> > return 0;
> >
> > + if (forced_update)
> > + goto apply_boost;
> > +
> > /* Reset boost if the CPU appears to have been idle enough */
> > if (sugov_iowait_reset(sg_cpu, time, false))
> > return 0;
> >
> > if (!sg_cpu->iowait_boost_pending) {
> > + /*
> > + * This logic relied on PELT signal decays happening once every
> > + * 1ms. But due to changes to how updates are done now, we can
> > + * end up with more request coming up leading to iowait boost
> > + * to be prematurely reduced. Make the assumption explicit
> > + * until we improve the iowait boost logic to be better in
> > + * general as it is due for an overhaul.
> > + */
> > + if (delta_ns <= NSEC_PER_MSEC)
> > + goto apply_boost;
> > +
> > /*
> > * No boost pending; reduce the boost value.
> > */
> > @@ -315,6 +339,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> > }
> > }
> >
> > +apply_boost:
> > sg_cpu->iowait_boost_pending = false;
> >
> > /*
> > @@ -358,10 +383,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
> >
> > ignore_dl_rate_limit(sg_cpu);
> >
> > - if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> > + if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags))
> > return false;
> >
> > - boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> > + boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags);
> > sugov_get_util(sg_cpu, boost);
> >
> > return true;
> > @@ -397,7 +422,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
> > sg_policy->cached_raw_freq = cached_freq;
> > }
> >
> > - if (!sugov_update_next_freq(sg_policy, time, next_f))
> > + if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
> > return;
> >
> > /*
> > @@ -449,10 +474,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
> > cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
> > sg_cpu->util, max_cap);
> >
> > - sg_cpu->sg_policy->last_freq_update_time = time;
> > + if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> > + sg_cpu->sg_policy->last_freq_update_time = time;
> > }
> >
> > -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> > +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time,
> > + unsigned int flags)
> > {
> > struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > struct cpufreq_policy *policy = sg_policy->policy;
> > @@ -465,7 +492,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> > struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
> > unsigned long boost;
> >
> > - boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> > + boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags);
> > sugov_get_util(j_sg_cpu, boost);
> >
> > util = max(j_sg_cpu->util, util);
> > @@ -488,10 +515,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
> >
> > ignore_dl_rate_limit(sg_cpu);
> >
> > - if (sugov_should_update_freq(sg_policy, time)) {
> > - next_f = sugov_next_freq_shared(sg_cpu, time);
> > + if (sugov_should_update_freq(sg_policy, time, flags)) {
> > + next_f = sugov_next_freq_shared(sg_cpu, time, flags);
> >
> > - if (!sugov_update_next_freq(sg_policy, time, next_f))
> > + if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
> > goto unlock;
> >
> > if (sg_policy->policy->fast_switch_enabled)
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index a04a436af8cc..02c9c2488091 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -252,8 +252,6 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> > dl_rq->running_bw += dl_bw;
> > SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
> > SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
> > - /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> > - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
> > }
> >
> > static inline
> > @@ -266,8 +264,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
> > SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
> > if (dl_rq->running_bw > old)
> > dl_rq->running_bw = 0;
> > - /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> > - cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
> > }
> >
> > static inline
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9eb63573110c..cbe79c8ac2ed 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3982,29 +3982,6 @@ static inline void update_cfs_group(struct sched_entity *se)
> > }
> > #endif /* CONFIG_FAIR_GROUP_SCHED */
> >
> > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> > -{
> > - struct rq *rq = rq_of(cfs_rq);
> > -
> > - if (&rq->cfs == cfs_rq) {
> > - /*
> > - * There are a few boundary cases this might miss but it should
> > - * get called often enough that that should (hopefully) not be
> > - * a real problem.
> > - *
> > - * It will not get called when we go idle, because the idle
> > - * thread is a different class (!fair), nor will the utilization
> > - * number include things like RT tasks.
> > - *
> > - * As is, the util number is not freq-invariant (we'd have to
> > - * implement arch_scale_freq_capacity() for that).
> > - *
> > - * See cpu_util_cfs().
> > - */
> > - cpufreq_update_util(rq, flags);
> > - }
> > -}
> > -
> > #ifdef CONFIG_SMP
> > static inline bool load_avg_is_decayed(struct sched_avg *sa)
> > {
> > @@ -4682,7 +4659,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, 0);
> > + cpufreq_update_util(rq_of(cfs_rq), 0);
> >
> > trace_pelt_cfs_tp(cfs_rq);
> > }
> > @@ -4712,7 +4689,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, 0);
> > + cpufreq_update_util(rq_of(cfs_rq), 0);
> >
> > trace_pelt_cfs_tp(cfs_rq);
> > }
> > @@ -4729,7 +4706,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > u64 now = cfs_rq_clock_pelt(cfs_rq);
> > - int decayed;
> >
> > /*
> > * Track task load average for carrying it to new CPU after migrated, and
> > @@ -4738,8 +4714,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> > __update_load_avg_se(now, cfs_rq, se);
> >
> > - decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > - decayed |= propagate_entity_load_avg(se);
> > + cfs_rq->decayed = update_cfs_rq_load_avg(now, cfs_rq);
> > + cfs_rq->decayed |= propagate_entity_load_avg(se);
> >
> > if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >
> > @@ -4760,11 +4736,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > */
> > detach_entity_load_avg(cfs_rq, se);
> > update_tg_load_avg(cfs_rq);
> > - } else if (decayed) {
> > - cfs_rq_util_change(cfs_rq, 0);
> > -
> > - if (flags & UPDATE_TG)
> > - update_tg_load_avg(cfs_rq);
> > + } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
> > + update_tg_load_avg(cfs_rq);
> > }
> > }
> >
> > @@ -5139,7 +5112,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >
> > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> > {
> > - cfs_rq_util_change(cfs_rq, 0);
> > }
> >
> > static inline void remove_entity_load_avg(struct sched_entity *se) {}
> > @@ -6754,14 +6726,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > */
> > util_est_enqueue(&rq->cfs, p);
> >
> > - /*
> > - * If in_iowait is set, the code below may not trigger any cpufreq
> > - * utilization updates, so do it here explicitly with the IOWAIT flag
> > - * passed.
> > - */
> > - if (p->in_iowait)
> > - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > -
> > for_each_sched_entity(se) {
> > if (se->on_rq)
> > break;
> > @@ -9351,10 +9315,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
> > unsigned long hw_pressure;
> > bool decayed;
> >
> > - /*
> > - * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> > - * DL and IRQ signals have been updated before updating CFS.
> > - */
> > curr_class = rq->curr->sched_class;
> >
> > hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> > @@ -12685,6 +12645,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >
> > update_misfit_status(curr, rq);
> > check_update_overutilized_status(task_rq(curr));
> > + cpufreq_update_util(rq, 0);
> >
> > task_tick_core(rq, curr);
> > }
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index 3261b067b67e..fe6d8b0ffa95 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -556,11 +556,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
> >
> > rt_se = rt_rq->tg->rt_se[cpu];
> >
> > - if (!rt_se) {
> > + if (!rt_se)
> > dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
> > - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> > - cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
> > - }
> > else if (on_rt_rq(rt_se))
> > dequeue_rt_entity(rt_se, 0);
> > }
> > @@ -1065,9 +1062,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq)
> > add_nr_running(rq, rt_rq->rt_nr_running);
> > rt_rq->rt_queued = 1;
> > }
> > -
> > - /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> > - cpufreq_update_util(rq, 0);
> > }
> >
> > #if defined CONFIG_SMP
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index cb3792c04eea..86cec2145221 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -632,6 +632,11 @@ struct cfs_rq {
> > unsigned long runnable_avg;
> > } removed;
> >
> > + /*
> > + * Store whether last update_load_avg() has decayed
> > + */
> > + bool decayed;
> > +
> > #ifdef CONFIG_FAIR_GROUP_SCHED
> > u64 last_update_tg_load_avg;
> > unsigned long tg_load_avg_contrib;
> > --
> > 2.34.1
> >