Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

From: Vincent Guittot
Date: Fri Jan 29 2021 - 12:31:35 EST


Le vendredi 29 janv. 2021 à 11:33:00 (+0100), Vincent Guittot a écrit :
> On Thu, 28 Jan 2021 at 16:09, Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > Hi Vincent,
> >
> > On Thu, Jan 28, 2021 at 8:57 AM Vincent Guittot
> > <vincent.guittot@xxxxxxxxxx> wrote:
> > > > On Mon, Jan 25, 2021 at 03:42:41PM +0100, Vincent Guittot wrote:
> > > > > On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > > > > > On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > > > > > > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > > > > > > <joel@xxxxxxxxxxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > > > > > > that there are a lot of calls to update_blocked_averages(). This causes
> > > > > > > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > > > > > > times (due to newidle load balance). I have also seen this manifest in
> > > > > > > > the periodic balancer.
> > > > > > > >
> > > > > > > > Closer look shows that the problem is caused by the following
> > > > > > > > ingredients:
> > > > > > > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > > > > > > suggesting to inspect /proc/sched_debug for this), this can make
> > > > > > > > __update_blocked_fair() take a long time.
> > > > > > >
> > > > > > > Inactive cgroups are removed from the list so they should not impact
> > > > > > > the duration
> > > > > >
> > > > > > I meant blocked CGroups. According to this code, a cfs_rq can be partially
> > > > > > decayed and not have any tasks running on it but its load needs to be
> > > > > > decayed, correct? That's what I meant by 'inactive'. I can reword it to
> > > > > > 'blocked'.
> > > > >
> > > > > How many blocked cgroups have you got ?
> > > >
> > > > I put a counter in for_each_leaf_cfs_rq_safe() { } to count how many times
> > > > this loop runs per new idle balance. When the problem happens I see this loop
> > > > run 35-40 times (for one single instance of newidle balance). So in total
> > > > there are at least these many cfs_rq load updates.
> > >
> > > Do you mean that you have 35-40 cgroups ? Or the 35-40 includes all CPUs ?
> >
> > All CPUs.
> >
> > > > I also see that new idle balance can be called 200-500 times per second.
> > >
> > > This is not surprising because newidle_balance() is called every time
> > > the CPU is about to become idle
> >
> > Sure.
> >
> > > > > >
> > > > > > * There can be a lot of idle CPU cgroups. Don't let fully
> > > > > > * decayed cfs_rqs linger on the list.
> > > > > > */
> > > > > > if (cfs_rq_is_decayed(cfs_rq))
> > > > > > list_del_leaf_cfs_rq(cfs_rq);
> > > > > >
> > > > > > > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > > > > > > shared frequency domain configuration to be slower than usual. (the load
> > > > > > >
> > > > > > > What do you mean exactly by it causes schedutil to be slower than usual ?
> > > > > >
> > > > > > sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> > > > > > system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> > > > > > frequency update needs to go through utilization of other CPUs in the
> > > > > > cluster. I believe this could be adding to the problem but is not really
> > > > > > needed to optimize if we can rate limit the calls to update_blocked_averages
> > > > > > to begin with.
> > > > >
> > > > > Qais mentioned half of the time being used by
> > > > > sugov_next_freq_shared(). Are there any frequency changes resulting in
> > > > > this call ?
> > > >
> > > > I do not see a frequency update happening at the time of the problem. However
> > > > note that sugov_iowait_boost() does run even if frequency is not being
> > > > updated. IIRC, this function is also not that light weight and I am not sure
> > > > if it is a good idea to call this that often.
> > >
> > > Scheduler can't make any assumption about how often schedutil/cpufreq
> > > wants to be called. Some are fast and straightforward and can be
> > > called very often to adjust frequency; Others can't handle much
> > > updates. The rate limit mechanism in schedutil and io-boost should be
> > > there for such purpose.
> >
> > Sure, I know that's the intention.
> >
> > > > > > > > average updates also try to update the frequency in schedutil).
> > > > > > > >
> > > > > > > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > > > > > > code paths to take longer than when running at a high CPU frequency.
> > > > > > >
> > > > > > > Low frequency usually means low utilization so it should happen that much.
> > > > > >
> > > > > > It happens a lot as can be seen with schbench. It is super easy to reproduce.
> > > > >
> > > > > Happening a lot in itself is not a problem if there is nothing else to
> > > > > do so it's not a argument in itself
> > > >
> > > > It is a problem - it shows up in the preempt off critical section latency
> > >
> > > But this point is not related to the point above which is about how
> > > often it happens.
> >
> > It is related. A bad thing happening quite often is worse than a bad
> > thing happening infrequently. I agree it is not a root cause fix, but
> > it makes things "better".
> >
> > > > tracer. Are you saying its Ok for preemption to be disabled on system for 500
> > > > micro seconds? That hurts real-time applications (audio etc).
> > >
> > > So. Is your problem related to real-time applications (audio etc) ?
> >
> > Yes it is. I don't have a reproducer and it could be the audio
> > buffering will hide the problem. But that doesn't mean that there is
> > no problem or that we cannot improve things. There will also likely be
> > power consumption improvement.
> >
> > > > > So why is it a problem for you ? You are mentioning newly idle load
> > > > > balance so I assume that your root problem is the scheduling delay
> > > > > generated by the newly idle load balance which then calls
> > > > > update_blocked_averages
> > > >
> > > > Yes, the new idle balance is when I see it happen quite often. I do see it
> > > > happen with other load balance as well, but it not that often as those LB
> > > > don't run as often as new idle balance.
> > >
> > > The update of average blocked load has been added in newidle_balance
> > > to take advantage of the cpu becoming idle but it seems to create a
> > > long preempt off sequence. I 'm going to prepare a patch to move it
> > > out the schedule path.
> >
> > Ok thanks, that would really help!
> >
> > > > > rate limiting the call to update_blocked_averages() will only reduce
> > > > > the number of time it happens but it will not prevent it to happen.
> > > >
> > > > Sure, but soft real-time issue can tolerate if the issue does not happen very
> > > > often. In this case though, it is frequent.
> > >
> > > Could you provide details of the problem that you are facing ? It's
> > > not clear for me what happens in your case at the end. Have you got an
> > > audio glitch as an example?
> > >
> > > "Not often" doesn't really give any clue.
> >
> > I believe from my side I have provided details. I shared output from a
> > function graph tracer and schbench micro benchmark clearly showing the
> > issue and improvements. Sorry, I don't have a real-world reproducer
>
> In fact I disagree and I'm not sure that your results show the right
> problem but just a side effect related to your system.
>
> With schbench -t 2 -m 2 -r 5, the test runs 1 task per CPU and newidle
> balance should never be triggered because tasks will get an idle cpus
> everytime. When I run schbench -t 3 -m 2 -r 5 (in order to get 8
> threads on my 8 cpus system), all threads directly wake up on an idle
> cpu and newidle balance is never involved. As a result, the schbench
> results are not impacted by newidle_balance whatever its duration.
> This means that a problem in newidle_balance doesn't impact schbench
> results with a correct task placement. This also means that in your
> case, some threads are placed on the same cpu and wait to be scheduled
> and finally a lot of things can generate the increased delay.... If
> you look at your results for schbench -t 2 -m 2 -r 5: The *99.0th:
> 12656 (8 samples) shows a delayed of around 12ms which is the typical
> running time slice of a task when several tasks are fighting for the
> same cpu and one has to wait. So this results does not reflect the
> duration of newidle balance but instead that the task placement was
> wrong and one task has to wait before running. Your RFC patch probably
> changes the dynamic and as a result the task placement but it does not
> save 12ms and is irrelevant regarding the problem that you raised
> about the duration of update_blocked_load.
>
> If I run schbench -t 3 -m 2 -r 5 on a dragonboard 845 (4 little, 4
> big) with schedutil and EAS enabled:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 315 (159 samples)
> 75.0th: 735 (80 samples)
> 90.0th: 773 (48 samples)
> 95.0th: 845 (16 samples)
> *99.0th: 12336 (12 samples)
> 99.5th: 15408 (2 samples)
> 99.9th: 17504 (1 samples)
> min=4, max=17473
>
> I have similar results and a quick look at the trace shows that 2
> tasks are fighting for the same cpu whereas there are idle cpus. Then
> If I use another cpufreq governor than schedutil like ondemand as an
> example, the EAS is disabled and the results becomes:
> /home/linaro/Bin/schbench -t 3 -m 2 -r 5
> Latency percentiles (usec) runtime 5 (s) (318 total samples)
> 50.0th: 232 (159 samples)
> 75.0th: 268 (80 samples)
> 90.0th: 292 (49 samples)
> 95.0th: 307 (15 samples)
> *99.0th: 394 (12 samples)
> 99.5th: 397 (2 samples)
> 99.9th: 400 (1 samples)
> min=114, max=400
>
> So a quick and wrong conclusion could be to say that you should disable EAS ;-)
>
> All that to say that your schbench's results are not relevant and I
> disagree when you said that you have provided all details of your
> problem.
>
> Then, your function graph shows a half millisecond duration but there
> are no details about the context. It seems that your trace happens
> after that sugov just changed the freq. But it would be not surprising
> to have such a large duration when running at the lowest
> capacity/frequency of the system as an example.
>
> All that to say that the main reason of my questions is just to make
> sure that the problem is fully understood and the results are
> relevants, which doesn't seem to be the case for schbench at least.
>
> The only point that I agree with, is that running
> update_blocked_averages with preempt and irq off is not a good thing
> because we don't manage the number of csf_rq to update and I'm going
> to provide a patchset for this

The patch below moves the update of the blocked load of CPUs outside newidle_balance().

Instead, the update is done with the usual idle load balance update. I'm working on an
additonnal patch that will select this cpu that is about to become idle, instead of a
random idle cpu but this 1st step fixe the problem of lot of update in newly idle.

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
kernel/sched/fair.c | 32 +++-----------------------------
1 file changed, 3 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 197a51473e0c..8200b1d4df3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7421,8 +7421,6 @@ enum migration_type {
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
#define LBF_SOME_PINNED 0x08
-#define LBF_NOHZ_STATS 0x10
-#define LBF_NOHZ_AGAIN 0x20

struct lb_env {
struct sched_domain *sd;
@@ -8426,9 +8424,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);

- if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
- env->flags |= LBF_NOHZ_AGAIN;
-
sgs->group_load += cpu_load(rq);
sgs->group_util += cpu_util(i);
sgs->group_runnable += cpu_runnable(rq);
@@ -8969,11 +8964,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
struct sg_lb_stats tmp_sgs;
int sg_status = 0;

-#ifdef CONFIG_NO_HZ_COMMON
- if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
- env->flags |= LBF_NOHZ_STATS;
-#endif
-
do {
struct sg_lb_stats *sgs = &tmp_sgs;
int local_group;
@@ -9010,15 +9000,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* Tag domain that child domain prefers tasks go to siblings first */
sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;

-#ifdef CONFIG_NO_HZ_COMMON
- if ((env->flags & LBF_NOHZ_AGAIN) &&
- cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
-
- WRITE_ONCE(nohz.next_blocked,
- jiffies + msecs_to_jiffies(LOAD_AVG_PERIOD));
- }
-#endif
-
if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);

@@ -10547,14 +10528,7 @@ static void nohz_newidle_balance(struct rq *this_rq)
return;

raw_spin_unlock(&this_rq->lock);
- /*
- * This CPU is going to be idle and blocked load of idle CPUs
- * need to be updated. Run the ilb locally as it is a good
- * candidate for ilb instead of waking up another idle CPU.
- * Kick an normal ilb if we failed to do the update.
- */
- if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
- kick_ilb(NOHZ_STATS_KICK);
+ kick_ilb(NOHZ_STATS_KICK);
raw_spin_lock(&this_rq->lock);
}

@@ -10616,8 +10590,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

- nohz_newidle_balance(this_rq);
-
goto out;
}

@@ -10683,6 +10655,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

if (pulled_task)
this_rq->idle_stamp = 0;
+ else
+ nohz_newidle_balance(this_rq);

rq_repin_lock(this_rq, rf);

--
2.17.1


>
> > for this.
> >
> > > Also update_blocked_averages was supposed called in newlyidle_balance
> > > when the coming idle duration is expected to be long enough
> >
> > No, we do not want the schedule loop to take half a millisecond.
>
> keep in mind that you are scaling frequency so everything takes time
> at lowest frequency/capacity ...
>
> >
> > > > > IIUC, your real problem is that newidle_balance is running whereas a
> > > > > task is about to wake up on the cpu and we don't abort quickly during
> > > > > this load_balance
> > > >
> > > > Yes.
> > > >
> > > > > so we could also try to abort earlier in case of newly idle load balance
> > > >
> > > > I think interrupts are disabled when the load balance runs, so there's no way
> > > > for say an audio interrupt to even run in order to wake up a task. How would
> > > > you know to abort the new idle load balance?
> > > >
> > > > Could you elaborate more also on the drawback of the rate limiting patch we
> > > > posted? Do you see a side effect?
> > >
> > > Your patch just tries to hide your problem and not to solve the root cause.
> >
> > Agreed, the solution presented is a band aid and not a proper fix. It
> > was just intended to illustrate the problem and start a discussion. I
> > should have marked it RFC for sure.
> >
> > thanks!
> >
> > - Joel