Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

From: Joel Fernandes
Date: Fri Nov 02 2018 - 18:14:36 EST

Next message: Arnd Bergmann: "Re: [PATCH] net/mlx5e: fix high stack usage"
Previous message: Nishanth Aravamudan: "Re: [RFC 00/60] Coscheduling for Linux"
In reply to: Paul E. McKenney: "Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu"
Next in thread: Paul E. McKenney: "Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Nov 02, 2018 at 01:00:00PM -0700, Paul E. McKenney wrote:
[..]
> > I think it would make sense also to combine it with other memory-ordering
> > topics like the memory model and rseq/cpu-opv things that Mathieu was doing
> > (if it makes sense to combine). But yes, I am definitely interested in an
> > RCU-implementation BoF session.
>
> There is an LKMM kernel summit track presentation. I believe that
> Mathieu's rseq/cpu-opv would be a good one as well, but Mathieu needs
> to lead this up and it should be a separate BoF. Please do feel free
> to reach out to him. I am sure that he would be particularly interested
> in potential uses of rseq and especially cpu-opv.

Cool! Looking forward to LKMM talk and I'll keep in mind to reach out to
Mathieu about rseq usecases.

> > > > > > Also about GP memory ordering and RCU-tree-locking, I think you mentioned to
> > > > > > me that the RCU reader-sections are virtually extended both forward and
> > > > > > backward and whereever it ends, those paths do heavy-weight synchronization
> > > > > > that should be sufficient to prevent memory ordering issues (such as those
> > > > > > you mentioned in the Requierments document). That is exactly why we don't
> > > > > > need explicit barriers during rcu_read_unlock. If I recall I asked you why
> > > > > > those are not needed. So that answer made sense, but then now on going
> > > > > > through the 'Memory Ordering' document, I see that you mentioned there is
> > > > > > reliance on the locking. Is that reliance on locking necessary to maintain
> > > > > > ordering then?
> > > > >
> > > > > There is a "network" of locking augmented by smp_mb__after_unlock_lock()
> > > > > that implements the all-to-all memory ordering mentioned above. But it
> > > > > also needs to handle all the possible complete()/wait_for_completion()
> > > > > races, even those assisted by hypervisor vCPU preemption.
> > > >
> > > > I see, so it sounds like the lock network is just a partial solution. For
> > > > some reason I thought before that complete() was even called on the CPU
> > > > executing the callback, all the CPUs would have acquired and released a lock
> > > > in the "lock network" atleast once thus ensuring the ordering (due to the
> > > > fact that the quiescent state reporting has to travel up the tree starting
> > > > from the leaves), but I think that's not necessarily true so I see your point
> > > > now.
> > >
> > > There is indeed a lock that is unconditionally acquired and released by
> > > wait_for_completion(), but it lacks the smp_mb__after_unlock_lock() that
> > > is required to get full-up any-to-any ordering. And unfortunate timing
> > > (as well as spurious wakeups) allow the interaction to have only normal
> > > lock-release/acquire ordering, which does not suffice in all cases.
> > >
> > > SRCU and expedited RCU grace periods handle this correctly. Only the
> > > normal grace periods are missing the needed barrier. The probability of
> > > failure is extremely low in the common case, which involves all sorts
> > > of synchronization on the wakeup path. It would be quite strange (but
> > > not impossible) for the wait_for_completion() exit path to -not- to do
> > > a full wakeup. Plus the bug requires a reader before the grace period
> > > to do a store to some location that post-grace-period code loads from.
> > > Which is a very rare use case.
> > >
> > > But it still should be fixed. ;-)
> > >
> > > > Did you feel this will violate condition 1. or condition 2. in "Memory-Barrier
> > > > Guarantees"? Or both?
> > > > https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.html#Memory-Barrier%20Guarantees
> > >
> > > Condition 1. There might be some strange combination of events that
> > > could also cause it to also violate condition 2, but I am not immediately
> > > seeing it.
> >
> > In the previous paragraph, you mentioned the bug "requires a reader before
> > the GP to do a store". However, condition 1 is really different - it is a
> > reader holding a reference to a pointer that is used *after* the
> > synchronize_rcu returns. So that reader's load of the pointer should have
> > completed by the time GP ends, otherwise the reader can look at kfree'd data.
> > That's different right?
>
> More specifically, the fix prevents a prior reader's -store- within its
> critical section to be seen as happening after a load that follows the
> end of the grace period. So I stand by Condition 1. ;-)
> And again, a store within an RCU read-side critical section is a bit
> on the strange side, but this sort of thing is perfectly legal and
> is used, albeit rather rarely.

Cool :) I never thought about condition 1 this way but good to know that's
possible :)

> > For condition 2, I analyzed it below, let me know what you think:
> >
> > > Thanx, Paul
> > >
> > > ------------------------------------------------------------------------
> > >
> > > commit bf3c11b7b9789283f993d9beb80caaabc4403916
> > > Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
> > > Date: Thu Nov 1 09:05:02 2018 -0700
> > >
> > > rcu: Add full memory barrier in __wait_rcu_gp()
> > >
> > > RCU grace periods have extremely strong any-to-any ordering
> > > requirements that are met by placing full barriers in various places
> > > in the grace-period computation. However, normal grace period requests
> > > might be subjected to a "fly-by" wakeup in which the requesting process
> > > doesn't actually sleep and in which the corresponding CPU is not yet
> > > aware that the grace period has ended. In this case, loads in the code
> > > immediately following the synchronize_rcu() call might potentially see
> > > values before stores preceding the grace period on other CPUs.
> > >
> > > This is an unusual use case, because RCU readers normally read. However,
> > > they can do writes, and if they do, we need post-grace-period reads to
> > > see those writes.
> > >
> > > This commit therefore adds an smp_mb() to the end of __wait_rcu_gp().
> > >
> > > Many thanks to Joel Fernandes for the series of questions leading to me
> > > realizing that this bug exists!
> > >
> > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>
> > >
> > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > index 1971869c4072..74020b558216 100644
> > > --- a/kernel/rcu/update.c
> > > +++ b/kernel/rcu/update.c
> > > @@ -360,6 +360,7 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
> > > wait_for_completion(&rs_array[i].completion);
> > > destroy_rcu_head_on_stack(&rs_array[i].head);
> > > }
> > > + smp_mb(); /* Provide ordering in case of fly-by wakeup. */
> > > }
> > > EXPORT_SYMBOL_GPL(__wait_rcu_gp);
> > >
https://cs.corp.google.com/piper///depot/google3/base/percpu.cc?type=cs&q=%22rseq%22+restart&sq=package:piper+file://depot/google3+-file:google3/experimental&g=0&l=247> >
> > The fix looks fine to me. Thanks.
> >
> > If I understand correctly the wait_for_completion() is an ACQUIRE operation,
> > and the complete() is a RELEASE operation aka the "MP pattern". The
> > ACQUIRE/RELEASE semantics allow any writes that happened before the ACQUIRE
> > to get ordered after it. So that would actually imply it is not strong enough
> > for ordering purposes during a "fly-by" wake up scenario and would be a
> > violation of CONDITION 2, I think (not only condition 1 as you said). This
> > is because future readers may accidentally see the writes that happened
> > *before* the synchronize_rcu which is CONDITION 2 in the requirements:
> > https://goo.gl/8mrDHN (I had to shortlink it ;))
>
> I do appreciate the shorter link. ;-)
>
> A write happening before the grace period is ordered by the grace period's
> network of strong barriers, so the fix does not matter in that case.

I was talking about the acquire/release pairs in the calls to spin_lock and
spin_unlock in wait_for_completion, not in the grace period network of rnp locks.
Does that make sense?

I thought that during a "fly-by" wake up, that network of strong barriers
doesn't really trigger and that that's the problematic scenario. Did I miss
something? I was talking about the acquire/release pair in
wait_for_completion during that fly-by scenario.

> Also, the exact end of the grace period is irrelevant for Condition 2,
> it is instead the beginning of the grace period compared to the beginning
> of later RCU read-side critical sections.
>
> Not saying that Condition 2 cannot somehow happen without the memory
> barrier, just saying that it will take quite a bit more creativity to
> find a relevant scenario.
>
> Please see below for the updated patch, containing only the typo fix.
> Please let me know if I messed anything up.

Looks good to me, thanks!

- Joel

Next message: Arnd Bergmann: "Re: [PATCH] net/mlx5e: fix high stack usage"
Previous message: Nishanth Aravamudan: "Re: [RFC 00/60] Coscheduling for Linux"
In reply to: Paul E. McKenney: "Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu"
Next in thread: Paul E. McKenney: "Re: [RFC] doc: rcu: remove note on smp_mb during synchronize_rcu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]