Re: [PATCH] sched/fair: Fix nohz.next_balance update

From: Peng Liu
Date: Tue May 05 2020 - 09:41:04 EST


On Mon, May 04, 2020 at 05:17:11PM +0200, Vincent Guittot wrote:
> On Sun, 3 May 2020 at 10:34, Peng Liu <iwtbavbm@xxxxxxxxx> wrote:
> >
> > commit c5afb6a87f23 ("sched/fair: Fix nohz.next_balance update")
> > During idle load balance, this_cpu(ilb) do load balance for the other
> > idle CPUs, also gather the earliest (nohz.)next_balance.
> >
> > Since commit:
> > 'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
> >
> > We update nohz.next_balance like this:
> >
> > _nohz_idle_balance() {
> > for_each_cpu(nohz.idle_cpus_mask) {
> > rebalance_domains() {
> > update nohz.next_balance <-- compare and update
> > }
> > }
> > rebalance_domains(this_cpu) {
> > update nohz.next_balance <-- compare and update
> > }
> > update nohz.next_balance <-- unconditionally update
> > }
> >
> > For instance, nohz.idle_cpus_mask spans {cpu2,3,5,8}, and this_cpu is
> > cpu5. After the above loop we could gather the earliest *next_balance*
> > among {cpu2,3,8}, then rebalance_domains(this_cpu) update
> > nohz.next_balance with this_rq->next_balance, but finally overwrite
> > nohz.next_balance with the earliest *next_balance* among {cpu2,3,8},
> > we may end up with not getting the earliest next_balance.
> >
> > Since we can gather all the updated rq->next_balance, including this_cpu,
> > in _nohz_idle_balance(), it's safe to remove the extra lines in
> > rebalance_domains() which are originally intended for this_cpu. And
> > finally the updating only happen in _nohz_idle_balance().
>
> I'm not sure that's always true. Nothing prevents nohz_idle_balance()
> to return false . Then run_rebalance_domains() calls
> rebalance_domains(this_rq ,SCHED_IDLE) outside _nohz_idle_balance().
> In this case we must keep the code in rebalance_domains().
>
> For example when the tick is not stopped when entering idle. Or when
> need_resched() returns true.
>
> So instead of removing the code from rebalance_domains, you should
> move the one in _nohz_idle_balance() to make sure that the "if
> (likely(update_next_balance)) ..." is called before calling
> rebalance_domains for the local cpu
>

Yes, you're right. When need_resched() returns true, things become out
of expectation. We haven't really got the earliest next_balance, abort
the update immediately and let the successor to help. Doubtless this
will incur some overhead due to the repeating work.

About the "tick is not stopped when entering idle" case, defer the
update to nohz_balance_enter_idle() would be a choice too.

Of course, only update nohz.next_balance in rebalance_domains() is the
simpliest way, but as @Valentin put, too many write to it may incur
unnecessary overhead. If we can gather the earliest next_balance in
advance, then a single write is considered to be better.

By the way, remove the redundant check in nohz_idle_balance().

FWIW, how about the below?
***********************************************
* Below code is !!!ENTIRELY UNTESTED!!!, just *
* a draft to see whehter it's sensible! *
***********************************************
-------------------<-----------------------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..a7d63ea706ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9943,22 +9943,8 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
* When the cpu is attached to null domain for ex, it will not be
* updated.
*/
- if (likely(update_next_balance)) {
+ if (likely(update_next_balance))
rq->next_balance = next_balance;
-
-#ifdef CONFIG_NO_HZ_COMMON
- /*
- * If this CPU has been elected to perform the nohz idle
- * balance. Other idle CPUs have already rebalanced with
- * nohz_idle_balance() and nohz.next_balance has been
- * updated accordingly. This CPU is now running the idle load
- * balance for itself and we need to update the
- * nohz.next_balance accordingly.
- */
- if ((idle == CPU_IDLE) && time_after(nohz.next_balance, rq->next_balance))
- nohz.next_balance = rq->next_balance;
-#endif
- }
}

static inline int on_null_domain(struct rq *rq)
@@ -10218,6 +10204,9 @@ void nohz_balance_enter_idle(int cpu)

rq->nohz_tick_stopped = 1;

+ if (time_after(nohz.next_balance, rq->next_balance))
+ nohz.next_balance = rq->next_balance;
+
cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
atomic_inc(&nohz.nr_cpus);

@@ -10287,6 +10276,7 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
*/
if (need_resched()) {
has_blocked_load = true;
+ update_next_balance = 0;
goto abort;
}

@@ -10321,9 +10311,15 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags,
has_blocked_load |= this_rq->has_blocked_load;
}

- if (flags & NOHZ_BALANCE_KICK)
+ if (flags & NOHZ_BALANCE_KICK) {
rebalance_domains(this_rq, CPU_IDLE);

+ if (time_after(next_balance, this_rq->next_balance)) {
+ next_balance = this_rq->next_balance;
+ update_next_balance = 1;
+ }
+ }
+
WRITE_ONCE(nohz.next_blocked,
now + msecs_to_jiffies(LOAD_AVG_PERIOD));

@@ -10354,9 +10350,7 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
{
int this_cpu = this_rq->cpu;
unsigned int flags;
-
- if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
- return false;
+ bool done;

if (idle != CPU_IDLE) {
atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
@@ -10368,9 +10362,16 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
if (!(flags & NOHZ_KICK_MASK))
return false;

- _nohz_idle_balance(this_rq, flags, idle);
+ /*
+ * If idle load balance terinated due to this CPU become busy,
+ * pretend it has successfully pulled some loads, and abort
+ * the following load balance.
+ */
+ done = _nohz_idle_balance(this_rq, flags, idle);
+ if (done == false && need_resched())
+ return true;

- return true;
+ return done;
}

static void nohz_newidle_balance(struct rq *this_rq)