Re: [PATCH v2] rcu/segcblist: Add debug checks for segment lengths

From: Paul E. McKenney
Date: Wed Dec 02 2020 - 10:25:47 EST


On Wed, Dec 02, 2020 at 09:58:38AM -0500, Joel Fernandes wrote:
> On Tue, Dec 01, 2020 at 08:21:43PM -0800, Paul E. McKenney wrote:
> > On Tue, Dec 01, 2020 at 05:26:32PM -0500, Joel Fernandes wrote:
> > > On Thu, Nov 19, 2020 at 3:42 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Nov 19, 2020 at 12:16:15PM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Nov 19, 2020 at 02:44:35PM -0500, Joel Fernandes wrote:
> > > > > > On Thu, Nov 19, 2020 at 2:22 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > > > > > > > > > On Wed, Nov 18, 2020 at 11:15:41AM -0500, Joel Fernandes (Google) wrote:
> > > > > > > > > > > > After rcu_do_batch(), add a check for whether the seglen counts went to
> > > > > > > > > > > > zero if the list was indeed empty.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> > > > > > > > > > >
> > > > > > > > > > > Queued for testing and further review, thank you!
> > > > > > > > > >
> > > > > > > > > > FYI, the second of the two checks triggered in all four one-hour runs of
> > > > > > > > > > TREE01, all four one-hour runs of TREE04, and one of the four one-hour
> > > > > > > > > > runs of TREE07. This one:
> > > > > > > > > >
> > > > > > > > > > WARN_ON_ONCE(count != 0 && rcu_segcblist_n_segment_cbs(&rdp->cblist) == 0);
> > > > > > > > > >
> > > > > > > > > > That is, there are callbacks in the list, but the sum of the segment
> > > > > > > > > > counts is nevertheless zero. The ->nocb_lock is held.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > FWIW, TREE01 reproduces it very quickly compared to the other two
> > > > > > > > > scenarios, on all four run, within five minutes.
> > > > > > > >
> > > > > > > > So far for TREE01, I traced it down to an rcu_barrier happening so it could
> > > > > > > > be related to some interaction with rcu_barrier() (Just a guess).
> > > > > > >
> > > > > > > Well, rcu_barrier() and srcu_barrier() are the only users of
> > > > > > > rcu_segcblist_entrain(), if that helps. Your modification to that
> > > > > > > function looks plausible to me, but the system's opinion always overrules
> > > > > > > mine. ;-)
> > > > > >
> > > > > > Right. Does anything the bypass code standout? That happens during
> > > > > > rcu_barrier() as well, and it messes with the lengths.
> > > > >
> > > > > In theory, rcu_barrier_func() flushes the bypass before doing the
> > > > > entrain, and does the rcu_segcblist_entrain() afterwards.
> > > > >
> > > > > Ah, and that is the issue. If ->cblist is empty and ->nocb_bypass
> > > > > is not, then ->cblist length will be nonzero, and none of the
> > > > > segments will be nonzero.
> > > > >
> > > > > So you need something like this for that second WARN, correct?
> > > > >
> > > > > WARN_ON_ONCE(!rcu_segcblist_empty(&rdp->cblist) &&
> > > > > rcu_segcblist_n_segment_cbs(&rdp->cblist) == 0);
> > >
> > > Just started to look into it again. If the &rdp->cblist is empty, that
> > > means the bypass list could not have been used (Since per comments on
> > > rcu_nocb_try_bypass() , the bypass list is in use only when the cblist
> > > is non-empty). So the cblist was non empty, then the segment counts
> > > should not sum to 0. So I don't think that explains it. Anyway, I
> > > will try the new version of your warning in case there is something
> > > about bypass lists that I'm missing.
> >
> > Good point. I really did see failures, though. Do they show up for
> > you?
>
> Yeah I do see failures. Once I change the warning as below, the failures go
> away though. So looks like indeed a segcblist can be empty when the bypass
> list has something in it? If you agree, could you change the warning to as
> below? The tests failing before all pass 1 hour rcutorture testing now
> (TREE01, TREE04 and TREE07).

I agree with the fix, which should not be too surprising. ;-)

But can you tell me how you can have a non-zero count while all the
segment counts sum to zero? Yes, you told me why nothing should be
placed in the bypass list while cblist is empty. Is that really the
only way this state can be entered?

Thanx, Paul

> ---8<-----------------------
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 91e35b521e51..3cd92b7df8ac 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2554,7 +2554,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
> WARN_ON_ONCE(!IS_ENABLED(CONFIG_RCU_NOCB_CPU) &&
> count != 0 && rcu_segcblist_empty(&rdp->cblist));
> WARN_ON_ONCE(count == 0 && rcu_segcblist_n_segment_cbs(&rdp->cblist) != 0);
> - WARN_ON_ONCE(count != 0 && rcu_segcblist_n_segment_cbs(&rdp->cblist) == 0);
> + WARN_ON_ONCE(!rcu_segcblist_empty(&rdp->cblist) &&
> + rcu_segcblist_n_segment_cbs(&rdp->cblist) == 0);
>
> rcu_nocb_unlock_irqrestore(rdp, flags);
>
> --
> 2.29.2.454.gaff20da3a2-goog
>