Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

From: Paul E. McKenney
Date: Mon May 21 2018 - 10:35:39 EST


On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote:
> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
> >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
> >> >> Function is going to be used in transport over RDMA module
> >> >> in subsequent patches.
> >> >>
> >> >> Function returns next element in round-robin fashion,
> >> >> i.e. head will be skipped. NULL will be returned if list
> >> >> is observed as empty.
> >> >>
> >> >> Signed-off-by: Roman Pen <roman.penyaev@xxxxxxxxxxxxxxxx>
> >> >> Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> >> >> Cc: linux-kernel@xxxxxxxxxxxxxxx
> >> >> ---
> >> >> include/linux/rculist.h | 19 +++++++++++++++++++
> >> >> 1 file changed, 19 insertions(+)
> >> >>
> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >> >> index 127f534fec94..b0840d5ab25a 100644
> >> >> --- a/include/linux/rculist.h
> >> >> +++ b/include/linux/rculist.h
> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >> >> })
> >> >>
> >> >> /**
> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
> >> >> + * @head: the head for the list.
> >> >> + * @ptr: the list head to take the next element from.
> >> >> + * @type: the type of the struct this is embedded in.
> >> >> + * @memb: the name of the list_head within the struct.
> >> >> + *
> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
> >> >> + * but if list is observed as empty, NULL will be returned.
> >> >> + *
> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation
> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >> >
> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
> >> > are round-robining a given list must all be under the same RCU read-side
> >> > critical section. For example, the following will break badly:
> >> >
> >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
> >> > {
> >> > struct foo *ret;
> >> >
> >> > rcu_read_lock();
> >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
> >> > rcu_read_unlock(); /* BUG */
> >> > return ret;
> >> > }
> >> >
> >> > You need a big fat comment stating this, at the very least. The resulting
> >> > bug can be very hard to trigger and even harder to debug.
> >> >
> >> > And yes, I know that the same restriction applies to list_next_rcu()
> >> > and friends. The difference is that if you try to invoke those in an
> >> > infinite loop, you will be rapped on the knuckles as soon as you hit
> >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings
> >> > might tempt people to do something broken like take_rr_step() above.
> >>
> >> Hi Paul,
> >>
> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
> >> connection from the list in order to send a request, i.e. my code is
> >> something like the following:
> >>
> >> static struct conn *get_and_set_next_conn(void)
> >> {
> >> struct conn *conn;
> >>
> >> conn = rcu_dereferece(rcu_conn);
> >> if (unlikely(!conn))
> >> return conn;
> >
> > Wait. Don't you need to restart from the beginning of the list in
> > this case? Or does the list never have anything added to it and is
> > rcu_conn initially the first element in the list?
>
> Hi Paul,
>
> No, I continue from the pointer, which I assigned on the previous IO
> in order to send IO fairly and keep load balanced.
>
> Initially @rcu_conn points to the first element, but elements can
> be deleted from the list and list can become empty.
>
> The deletion code is below.
>
> >
> >> conn = list_next_or_null_rr_rcu(&conn_list,
> >> &conn->entry,
> >> typeof(*conn),
> >> entry);
> >> rcu_assign_pointer(rcu_conn, conn);
> >
> > Linus is correct to doubt this code. You assign a pointer to the current
> > element to rcu_conn, which is presumably a per-CPU or global variable.
> > So far, so good ...
>
> I use per-CPU, in the first example I did not show that not to overcomplicate
> the code.
>
> >
> >> return conn;
> >> }
> >>
> >> rcu_read_lock();
> >> conn = get_and_set_next_conn();
> >> if (unlikely(!conn)) {
> >> /* ... */
> >> }
> >> err = rdma_io(conn, request);
> >> rcu_read_unlock();
> >
> > ... except that some other CPU might well remove the entry referenced by
> > rcu_conn at this point. It would have to wait for a grace period (e.g.,
> > synchronize_rcu()), but the current CPU has exited its RCU read-side
> > critical section, and therefore is not blocking the grace period.
> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> > might well be referencing the freelist, or, even worse, some other type
> > of structure.
> >
> > What is your code doing to prevent this from happening? (There are ways,
> > but I want to know what you were doing in this case.)
>
> Probably I should have shown the way of removal at the very beginning,
> my fault. So deletion looks as the following (a bit changed and
> simplified for the sake of clearness):

Thank you! Let's see...

> static void remove_connection(conn)
> {
> bool need_to_wait = false;
> int cpu;
>
> /* Do not let RCU list add/delete happen in parallel */
> mutex_lock(&conn_lock);
>
> list_del_rcu(&conn->entry);
>
> /* Make sure everybody observes element removal */
> synchronize_rcu();

At this point, any reader who saw the element in the list is done, as you
comment in fact says. But there might be a pointer to that element in the
per-CPU variables, however, from this point forward it cannot be the case
that one of the per-CPU variables gets set to the newly deleted element.
Which is your next block of code...

> /*
> * At this point nobody sees @conn in the list, but
> still we have
> * dangling pointer @rcu_conn which _can_ point to @conn. Since
> * nobody can observe @conn in the list, we guarantee
> that IO path
> * will not assign @conn to @rcu_conn, i.e. @rcu_conn
> can be equal
> * to @conn, but can never again become @conn.
> */
>
> /*
> * Get @next connection from current @conn which is going to be
> * removed.
> */
> next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
> typeof(*next), entry);
>
> /*
> * Here @rcu_conn can be changed by reader side, so use @cmpxchg
> * in order to keep fairness in load-balancing and do not touch
> * the pointer which can be already changed by the IO path.
> *
> * Current path can be faster than IO path and the
> following race
> * exists:
> *
> * CPU0 CPU1
> * ---- ----
> * conn = rcu_dereferece(rcu_conn);
> * next = list_next_or_null_rr_rcu(conn)
> *
> * conn ==
> cmpxchg(rcu_conn, conn, next);
> * synchronize_rcu();
> *
> * rcu_assign_pointer(rcu_conn, next);
> * ^^^^^^^^^^^^^^^^^^
> *
> * Here @rcu_conn is already equal to @next (done by
> @cmpxchg),
> * so assignment to the same pointer is harmless.
> *
> */
> for_each_possible_cpu(cpu) {
> struct conn **rcu_conn;
>
> rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
> if (*rcu_conn != conn)
> /*
> * This @cpu will never again pick up @conn,
> * so it is safe just to choose next CPU.
> */
> continue;

... Someone else might have picked up rcu_conn at this point...

> if (conn == cmpxchg(rcu_conn, conn, next))
> /*
> * @rcu_conn was successfully replaced
> with @next,
> * that means that someone can also hold a @conn
> * and dereferencing it, so wait for a
> grace period
> * is required.
> */
> need_to_wait = true;

... But if there was any possibility of that, need_to_wait is true, and it
still cannot be the case that a reader finds the newly deleted element
in the list, so they cannot find that element, so the pcpu_rcu_conn
variables cannot be set to it.

> }
> if (need_to_wait)
> synchronize_rcu();

And at this point, the reader that might have picked up rcu_conn
just before the cmpxchg must have completed. (Good show, by the way!
Many people miss the fact that they need this second synchronize_rcu().)

Hmmm... What happens if this was the last element in the list, and
the relevant pcpu_rcu_conn variable references that newly removed
element? Taking a look at list_next_or_null_rcu() and thus at
list_next_or_null_rcu(), and it does appear that you get NULL in that
case, as is right and good.

> mutex_unlock(&conn_lock);
>
> kfree(conn);
> }
>
>
> >
> >> i.e. usage of the @next pointer is under an RCU critical section.
> >>
> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
> >> > macro that makes it more obvious that the whole thing need to be under
> >> > a single RCU read-side critical section? Such a macro would of course be
> >> > an infinite loop if the list never went empty, so presumably there would
> >> > be a break or return statement in there somewhere.
> >>
> >> The difference is that I do not need a loop, I take the @next conn pointer,
> >> save it for the following IO request and do IO for current IO request.
> >>
> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
> >> of the loop does not look nice, I personally do not like it, i.e.:
> >>
> >>
> >> static struct conn *get_and_set_next_conn(void)
> >> {
> >> struct conn *conn;
> >>
> >> conn = rcu_dereferece(rcu_conn);
> >> if (unlikely(!conn))
> >> return conn;
> >> list_for_each_entry_rr_rcu(conn, &conn_list,
> >> entry) {
> >> break;
> >> }
> >> rcu_assign_pointer(rcu_conn, conn);
> >> return conn;
> >> }
> >>
> >>
> >> or maybe I did not fully get your idea?
> >
> > That would not help at all because you are still leaking the pointer out
> > of the RCU read-side critical section. That is completely and utterly
> > broken unless you are somehow cleaning up rcu_conn when you remove
> > the element. And getting that cleanup right is -extremely- tricky.
> > Unless you have some sort of proof of correctness, you will get a NACK
> > from me.
>
> I understand all the consequences of the leaking pointer, and of course
> wrapped loop with RCU lock/unlock is simpler, but in order to keep
> load-balancing and IO fairness avoiding any locks on IO path I've come
> up with these RCU tricks and list_next_or_null_rr_rcu() macro.

At first glance, it appears that you have handled this correctly. But I
can make mistakes just as easily as the next guy, so what have you done
to validate your algorithm?

> > More like this:
> >
> > list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
> > do_something_with(conn);
> > if (done_for_now())
> > break;
> > }
> >
> >> >> + */
> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
> >> >> +({ \
> >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \
> >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
> >> >
> >> > Are there any uses for this outside of RDMA? If not, I am with Linus.
> >> > Define this within RDMA, where a smaller number of people can more
> >> > easily be kept aware of the restrictions on use. If it turns out to be
> >> > more generally useful, we can take a look at exactly what makes sense
> >> > more globally.
> >>
> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
> >>
> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
> >>
> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
> >> my list_next_or_null_rr_rcu() variant?
> >
> > Let's start with the basics: It absolutely does not make sense to leak
> > pointers across rcu_read_unlock() unless you have arranged something else
> > to protect the pointed-to data in the meantime. There are a number of ways
> > of implementing this protection. Again, what protection are you using?
> >
> > Your code at the above URL looks plausible to me at first glance: You
> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> > rcu_read_unlock(). But at second glance, it looks like htcx->queue
> > might have the same vulnerability as rcu_conn in your earlier code.
>
> I am not the author of the code at the URL I specified. I provided the
> link answering the question to show other possible users of round-robin
> semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()'
> case I can't use a loop, I leak the pointer and indeed have to be very
> careful. But perhaps we can come up with some generic solution to cover
> both cases: -rr loop and -rr next.

Ah. Could you please check their update-side code to make sure that it
looks correct to you?

Thanx, Paul