Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle statemachine

From: Paul E. McKenney
Date: Fri Jul 19 2013 - 01:07:12 EST


On Fri, Jul 19, 2013 at 04:12:08AM +0200, Frederic Weisbecker wrote:
> On Thu, Jul 18, 2013 at 05:24:08PM -0700, Paul E. McKenney wrote:
> > On Fri, Jul 19, 2013 at 12:46:21AM +0200, Frederic Weisbecker wrote:
> > > On Thu, Jul 18, 2013 at 09:47:49AM -0700, Paul E. McKenney wrote:
> > > > 1. Some CPU coming out of idle:
> > > >
> > > > o rcu_sysidle_exit():
> > > >
> > > > smp_mb__before_atomic_inc();
> > > > atomic_inc(&rdtp->dynticks_idle);
> > > > smp_mb__after_atomic_inc(); /* A */
> > > >
> > > > o rcu_sysidle_force_exit():
> > > >
> > > > oldstate = ACCESS_ONCE(full_sysidle_state);
> > > >
> > > > 2. RCU GP kthread:
> > > >
> > > > o rcu_sysidle():
> > > >
> > > > cmpxchg(&full_sysidle_state, RCU_SYSIDLE_SHORT, RCU_SYSIDLE_LONG);
> > > > /* B */
> > > >
> > > > o rcu_sysidle_check_cpu():
> > > >
> > > > cur = atomic_read(&rdtp->dynticks_idle);
> > > >
> > > > Memory barrier A pairs with memory barrier B, so that if #1's load
> > > > from full_sysidle_state sees RCU_SYSIDLE_SHORT, we know that #1's
> > > > atomic_inc() must be visible to #2's atomic_read(). This will cause #2
> > > > to recognize that the CPU came out of idle, which will in turn cause it
> > > > to invoke rcu_sysidle_cancel() instead of rcu_sysidle(), resulting in
> > > > full_sysidle_state being set to RCU_SYSIDLE_NOT.
> > >
> > > Ok I get it for that direction.
> > > Now imagine CPU 0 is the RCU GP kthread (#2) and CPU 1 is idle and stays
> > > so.
> > >
> > > CPU 0 then rounds and see that all CPUs are idle, until it finally sets
> > > up RCU_SYSIDLE_SHORT_FULL and finally goes to sleep.
> > >
> > > Then CPU 1 wakes up. It really has to see a value above RCU_SYSIDLE_SHORT
> > > otherwise it won't do the cmpxchg and see the FULL_NOTED that makes it send
> > > the IPI.
> > >
> > > What provides the guarantee that CPU 1 sees a value above RCU_SYSIDLE_SHORT?
> > > Not on the cmpxchg but when it first dereference with ACCESS_ONCE.
> >
> > The trick is that CPU 0 will have scanned, moved to RCU_SYSIDLE_SHORT,
> > scanned, moved to RCU_SYSIDLE_LONG, then scanned again before moving
> > to RCU_SYSIDLE_FULL. Given CPU 1 has been idle all this time, CPU 0
> > will have read its ->dynticks_idle counter on each scan and seen it
> > having an even value. When CPU 1 comes out of idle, it will atomically
> > increment its ->dyntick_idle(), which will happen after CPU 0's read of
> > ->dyntick_idle() during its last scan.
> >
> > Because of the memory-barrier pairing above, this means that CPU
> > 1's read from full_sysidle_state must follow the cmpxchg() that
> > set full_sysidle_state to RCU_SYSIDLE_LONG (though not necessarily
> > the two later cmpxchg()s that set it to RCU_SYSIDLE_FULL and
> > RCU_SYSIDLE_FULL_NOTED). But because RCU_SYSIDLE_LONG is greater than
> > RCU_SYSIDLE_SHORT, CPU 1 will take action to end the idle period.
>
> Lets summarize the last sequence, the following happens ordered by time:
>
> CPU 0 CPU 1
>
> cmpxchg(&full_sysidle_state,
> RCU_SYSIDLE_SHORT,
> RCU_SYSIDLE_LONG);
>
> smp_mb() //cmpxchg
>
> atomic_read(rdtp(1)->dynticks_idle)
>
> //CPU 0 goes to sleep
> //CPU 1 wakes up
> atomic_inc(rdtp(1)->dynticks_idle)
>
> smp_mb()
>
> ACCESS_ONCE(full_sysidle_state)
>
>
> Are you suggesting that because the CPU 1 executes its atomic_inc() _after_ (in terms
> of absolute time) the atomic_read of CPU 0, the ordering settled in both sides guarantees
> that the value read from CPU 1 is the one from the cmpxchg that precedes the atomic_read,
> or FULL or FULL_NOTED that happen later.
>
> If so that's a big lesson for me.

It is not absolute time that matters. Instead, it is the fact that
CPU 0, when reading from ->dynticks_idle, read the old value before the
atomic_inc(). Therefore, anything CPU 0 did before that memory barrier
preceding CPU 0's read must come before anything CPU 1 did after that
memory barrier following the atomic_inc(). For this to work, there
must be some access to the same variable on each CPU.

Or, if you must think in terms of time, you need a separate independent
timeline for each variable, with no direct mapping from one timeline to
another, except resulting from memory-barrier interactions.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/