Re: Internal vs. external barriers (was: Re: Interesting LKMM litmus test)

From: Paul E. McKenney
Date: Wed Jan 18 2023 - 19:02:25 EST


On Wed, Jan 18, 2023 at 03:54:47PM -0500, Alan Stern wrote:
> On Wed, Jan 18, 2023 at 12:06:01PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 18, 2023 at 11:50:24AM -0500, Alan Stern wrote:
> > Boqun mentioned off-list this morning that this is still the case,
> > and that each execution of srcu_read_lock() will return a unique value.
> > Assuming that I understood him correctly, anyway.
>
> That will no longer be true with the patch I posted yesterday. Every
> execution of srcu_read_lock() will return 0 (or whatever the initial
> value of the lock variable is).
>
> But with a small change to the .def file, each execution of
> srcu_read_unlock() can be made to increment the lock's value, and then
> the next srcu_read_lock() would naturally return the new value.

Different values might be good for debugging, but I am not sure that this
is worth the weight. Happy to go with whatever you decide on this one.

> > > > given that I have no idea how one would go about modeling down_read()
> > > > and up_read() in LKMM.
> > >
> > > It might make sense to work on that first, before trying to do
> > > srcu_down_read() and srcu_up_read().
> >
> > The thing is that it is easy to associate an srcu_down_read() with the
> > corresponding srcu_up_read(). With down() and up(), although in the
> > Linux kernel this might be represented by a data structure tracking
> > (say) an I/O request, LKMM is going to be hard pressed to figure that out.
>
> It would help (or at least, it would help _me_) if you gave a short
> explanation of how srcu_down_read() and srcu_up_read() are meant to
> work. With regular r/w semaphores, the initial lock value is 0, each
> down() operation decrements the value, each up() operation increments
> the value -- or vice versa if you don't like negative values -- and a
> write_lock() will wait until the value is >= 0. In that setting, it
> makes sense to say that a down() which changes the value from n to n-1
> matches the next up() which changes the value from n-1 to n.
>
> I presume that srcu semaphores do not work this way. Particularly since
> the down() operation returns a value which must be passed to the
> corresponding up() operation. So how _do_ they work?

There are pairs of per-CPU counters. One pair (->srcu_lock_count[])
counts the number of srcu_down_read() operations that took place on
that CPU and another pair (->srcu_unlock_count[]) counts the number
of srcu_down_read() operations that took place on that CPU. There is
an ->srcu_idx that selects which of the ->srcu_lock_count[] elements
should be incremented by srcu_down_read(). Of course, srcu_down_read()
returns the value of ->srcu_idx that it used so that the matching
srcu_up_read() will use that same index when incrementing its CPU's
->srcu_unlock_count[].

Grace periods go something like this:

1. Sum up the ->srcu_unlock_count[!ssp->srcu_idx] counters.

2. smp_mb().

3. Sum up the ->srcu_unlock_count[!ssp->srcu_idx] counters.

4. If the sums are not equal, retry from #1.

5. smp_mb().

6. WRITE_ONCE(ssp->srcu_idx, !ssp->srcu_idx);

7. smp_mb().

8. Same loop as #1-4.

So similar to r/w semaphores, but with two separate distributed counts.
This means that the number of readers need not go to zero at any given
point in time, consistent with the need to wait only on old readers.

> > > Hmmm. What happens if you write:
> > >
> > > r1 = srcu_down_read(x);
> > > r2 = srcu_down_read(x);
> > > srcu_up_read(x, r1);
> > > srcu_up_read(x, r2);
> > >
> > > ? I can't even tell what that would be _intended_ to do.
> >
> > Let's take it one line at a time:
> >
> > r1 = srcu_down_read(x);
> > // A
> > r2 = srcu_down_read(x);
> > // B
> > srcu_up_read(x, r1);
> > // C
> > srcu_up_read(x, r2);
> > // D
> >
> > An SRCU grace period that starts at A is permitted to complete at
> > C, difficult though it might be to actually make this happen in the
> > Linux kernel. It need wait only for pre-existing critical sections.
>
> So the down() returning r1 matches the up() receiving r1?

Yes.

> > But an SRCU grace period that starts at either B or C must wait for both
> > critical sections, that is until D.
>
> Implying that the down() returning r2 matches up() receiving r2?

Again, yes.

> And in general, an up() matches a down() iff they have the same values?

I would instead say that the match is determined by the fact that a
given srcu_up_read() receives what was returned from the corresponding
srcu_down_read(). So not a comparison of values, but rather something
like (data | rf)*.

> And we can imagine that every down() returns a different value?

For purposes of herd7 emuation before your rework, yes. In real life,
and as an implementation detail, a given srcu_down_read() only ever
returns either 0 or 1. And from what I can see playing around, the fact
that any given srcu_down_read() only ever returns zero is not a problem,
the (data | rf)* still works fine. Thus far, anyway. ;-)

> How does this differ from srcu_read_lock() and srcu_read_unlock()? And
> how do the "up" and "down" parts figure into it? -- what is going up or
> down?

Functionally and from a performance/scalability viewpoint, they
are identical to srcu_read_lock() and srcu_read_unlock(). The only
difference is that srcu_down_read() and srcu_up_read() lack the lockdep
machinery that complains when a matching pair of srcu_read_lock() and
srcu_read_unlock() are used from different tasks.

Within the implementation, nothing ever goes down, it is all
this_cpu_inc(). The "down" and "up" are by analogy to down() and up(),
where "down()" says acquire some rights to a resource and "up()" says
release those rights.

Wait, I can make "down" work.

A call to srcu_down_read() reduces the quantity computed by summing the
unlocks then subtracting the sum of the locks. A call to srcu_up_read()
increases that same quantity. ;-)

Thanx, Paul