Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design ofPer-CPU Reader-Writer Locks

From: Srivatsa S. Bhat
Date: Sun Feb 10 2013 - 14:13:21 EST

Next message: Srivatsa S. Bhat: "Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design ofPer-CPU Reader-Writer Locks"
Previous message: Pavel Machek: "Re: PAE problems was [RFC] Reproducible OOM with just a few sleeps"
In reply to: Paul E. McKenney: "Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design ofPer-CPU Reader-Writer Locks"
Next in thread: Paul E. McKenney: "Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design ofPer-CPU Reader-Writer Locks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 02/09/2013 04:40 AM, Paul E. McKenney wrote:
> On Tue, Jan 22, 2013 at 01:03:53PM +0530, Srivatsa S. Bhat wrote:
>> Using global rwlocks as the backend for per-CPU rwlocks helps us avoid many
>> lock-ordering related problems (unlike per-cpu locks). However, global
>> rwlocks lead to unnecessary cache-line bouncing even when there are no
>> writers present, which can slow down the system needlessly.
>>
[...]
>
> Looks pretty close! Some comments interspersed below. Please either
> fix the code or my confusion, as the case may be. ;-)
>

Sure :-)

>> ---
>>
>> include/linux/percpu-rwlock.h | 10 +++
>> lib/percpu-rwlock.c | 128 ++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 136 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/percpu-rwlock.h b/include/linux/percpu-rwlock.h
>> index 8dec8fe..6819bb8 100644
>> --- a/include/linux/percpu-rwlock.h
>> +++ b/include/linux/percpu-rwlock.h
>> @@ -68,4 +68,14 @@ extern void percpu_free_rwlock(struct percpu_rwlock *);
>> __percpu_init_rwlock(pcpu_rwlock, #pcpu_rwlock, &rwlock_key); \
>> })
>>
>> +#define reader_uses_percpu_refcnt(pcpu_rwlock, cpu) \
>> + (ACCESS_ONCE(per_cpu(*((pcpu_rwlock)->reader_refcnt), cpu)))
>> +
>> +#define reader_nested_percpu(pcpu_rwlock) \
>> + (__this_cpu_read(*((pcpu_rwlock)->reader_refcnt)) > 1)
>> +
>> +#define writer_active(pcpu_rwlock) \
>> + (__this_cpu_read(*((pcpu_rwlock)->writer_signal)))
>> +
>> #endif
>> +
>> diff --git a/lib/percpu-rwlock.c b/lib/percpu-rwlock.c
>> index 80dad93..992da5c 100644
>> --- a/lib/percpu-rwlock.c
>> +++ b/lib/percpu-rwlock.c
>> @@ -64,21 +64,145 @@ void percpu_free_rwlock(struct percpu_rwlock *pcpu_rwlock)
>>
>> void percpu_read_lock(struct percpu_rwlock *pcpu_rwlock)
>> {
>> - read_lock(&pcpu_rwlock->global_rwlock);
>> + preempt_disable();
>> +
>> + /* First and foremost, let the writer know that a reader is active */
>> + this_cpu_inc(*pcpu_rwlock->reader_refcnt);
>> +
>> + /*
>> + * If we are already using per-cpu refcounts, it is not safe to switch
>> + * the synchronization scheme. So continue using the refcounts.
>> + */
>> + if (reader_nested_percpu(pcpu_rwlock)) {
>> + goto out;
>> + } else {
>> + /*
>> + * The write to 'reader_refcnt' must be visible before we
>> + * read 'writer_signal'.
>> + */
>> + smp_mb(); /* Paired with smp_rmb() in sync_reader() */
>> +
>> + if (likely(!writer_active(pcpu_rwlock))) {
>> + goto out;
>> + } else {
>> + /* Writer is active, so switch to global rwlock. */
>> + read_lock(&pcpu_rwlock->global_rwlock);
>> +
>> + /*
>> + * We might have raced with a writer going inactive
>> + * before we took the read-lock. So re-evaluate whether
>> + * we still need to hold the rwlock or if we can switch
>> + * back to per-cpu refcounts. (This also helps avoid
>> + * heterogeneous nesting of readers).
>> + */
>> + if (writer_active(pcpu_rwlock))
>
> The above writer_active() can be reordered with the following this_cpu_dec(),
> strange though it might seem. But this is OK because holding the rwlock
> is conservative. But might be worth a comment.
>

Ok..

>> + this_cpu_dec(*pcpu_rwlock->reader_refcnt);
>> + else
>
> In contrast, no reordering can happen here because read_unlock() is
> required to keep the critical section underneath the lock.
>

Ok..

>> + read_unlock(&pcpu_rwlock->global_rwlock);
>> + }
>> + }
>> +
>> +out:
>> + /* Prevent reordering of any subsequent reads */
>> + smp_rmb();
>
> This should be smp_mb(). "Readers" really can do writes. Hence the
> name lglock -- "local/global" rather than "reader/writer".
>

Ok!

[ We were trying to avoid full memory barriers in get/put_online_cpus_atomic()
in the fastpath, as far as possible... Now it feels like all that discussion
was for nothing :-( ]

>> }
>>
>> void percpu_read_unlock(struct percpu_rwlock *pcpu_rwlock)
>> {
>> - read_unlock(&pcpu_rwlock->global_rwlock);
>
> We need an smp_mb() here to keep the critical section ordered before the
> this_cpu_dec() below. Otherwise, if a writer shows up just after we
> exit the fastpath, that writer is not guaranteed to see the effects of
> our critical section. Equivalently, the prior read-side critical section
> just might see some of the writer's updates, which could be a bit of
> a surprise to the reader.
>

Ok, will add it.

>> + /*
>> + * We never allow heterogeneous nesting of readers. So it is trivial
>> + * to find out the kind of reader we are, and undo the operation
>> + * done by our corresponding percpu_read_lock().
>> + */
>> + if (__this_cpu_read(*pcpu_rwlock->reader_refcnt)) {
>> + this_cpu_dec(*pcpu_rwlock->reader_refcnt);
>> + smp_wmb(); /* Paired with smp_rmb() in sync_reader() */
>
> Given an smp_mb() above, I don't understand the need for this smp_wmb().
> Isn't the idea that if the writer sees ->reader_refcnt decremented to
> zero, it also needs to see the effects of the corresponding reader's
> critical section?
>

Not sure what you meant, but my idea here was that the writer should see
the reader_refcnt falling to zero as soon as possible, to avoid keeping the
writer waiting in a tight loop for longer than necessary.
I might have been a little over-zealous to use lighter memory barriers though,
(given our lengthy discussions in the previous versions to reduce the memory
barrier overheads), so the smp_wmb() used above might be wrong.

So, are you saying that the smp_mb() you indicated above would be enough
to make the writer observe the 1->0 transition of reader_refcnt immediately?

> Or am I missing something subtle here? In any case, if this smp_wmb()
> really is needed, there should be some subsequent write that the writer
> might observe. From what I can see, there is no subsequent write from
> this reader that the writer cares about.
>

I thought the smp_wmb() here and the smp_rmb() at the writer would ensure
immediate reflection of the reader state at the writer side... Please correct
me if my understanding is incorrect.

>> + } else {
>> + read_unlock(&pcpu_rwlock->global_rwlock);
>> + }
>> +
>> + preempt_enable();
>> +}
>> +
>> +static inline void raise_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>> + unsigned int cpu)
>> +{
>> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = true;
>> +}
>> +
>> +static inline void drop_writer_signal(struct percpu_rwlock *pcpu_rwlock,
>> + unsigned int cpu)
>> +{
>> + per_cpu(*pcpu_rwlock->writer_signal, cpu) = false;
>> +}
>> +
>> +static void announce_writer_active(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> + unsigned int cpu;
>> +
>> + for_each_online_cpu(cpu)
>> + raise_writer_signal(pcpu_rwlock, cpu);
>> +
>> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>> +}
>> +
>> +static void announce_writer_inactive(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> + unsigned int cpu;
>> +
>> + drop_writer_signal(pcpu_rwlock, smp_processor_id());
>
> Why do we drop ourselves twice? More to the point, why is it important to
> drop ourselves first?
>

I don't see where we are dropping ourselves twice. Note that we are no longer
in the cpu_online_mask, so the 'for' loop below won't include us. So we need
to manually drop ourselves. It doesn't matter whether we drop ourselves first
or later.

>> +
>> + for_each_online_cpu(cpu)
>> + drop_writer_signal(pcpu_rwlock, cpu);
>> +
>> + smp_mb(); /* Paired with smp_rmb() in percpu_read_[un]lock() */
>> +}
>> +
>> +/*
>> + * Wait for the reader to see the writer's signal and switch from percpu
>> + * refcounts to global rwlock.
>> + *
>> + * If the reader is still using percpu refcounts, wait for him to switch.
>> + * Else, we can safely go ahead, because either the reader has already
>> + * switched over, or the next reader that comes along on that CPU will
>> + * notice the writer's signal and will switch over to the rwlock.
>> + */
>> +static inline void sync_reader(struct percpu_rwlock *pcpu_rwlock,
>> + unsigned int cpu)
>> +{
>> + smp_rmb(); /* Paired with smp_[w]mb() in percpu_read_[un]lock() */
>
> As I understand it, the purpose of this memory barrier is to ensure
> that the stores in drop_writer_signal() happen before the reads from
> ->reader_refcnt in reader_uses_percpu_refcnt(),

No, that was not what I intended. announce_writer_inactive() already does
a full smp_mb() after calling drop_writer_signal().

I put the smp_rmb() here and the smp_wmb() at the reader side (after updates
to the ->reader_refcnt) to reflect the state change of ->reader_refcnt
immediately at the writer, so that the writer doesn't have to keep spinning
unnecessarily still referring to the old (non-zero) value of ->reader_refcnt.
Or perhaps I am confused about how to use memory barriers properly.. :-(

> thus preventing the
> race between a new reader attempting to use the fastpath and this writer
> acquiring the lock. Unless I am confused, this must be smp_mb() rather
> than smp_rmb().
>
> Also, why not just have a single smp_mb() at the beginning of
> sync_all_readers() instead of executing one barrier per CPU?

Well, since my intention was to help the writer see the update (->reader_refcnt
dropping to zero) ASAP, I kept the multiple smp_rmb()s.

>
>> +
>> + while (reader_uses_percpu_refcnt(pcpu_rwlock, cpu))
>> + cpu_relax();
>> +}
>> +
>> +static void sync_all_readers(struct percpu_rwlock *pcpu_rwlock)
>> +{
>> + unsigned int cpu;
>> +
>> + for_each_online_cpu(cpu)
>> + sync_reader(pcpu_rwlock, cpu);
>> }
>>
>> void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>> {
>> + /*
>> + * Tell all readers that a writer is becoming active, so that they
>> + * start switching over to the global rwlock.
>> + */
>> + announce_writer_active(pcpu_rwlock);
>> + sync_all_readers(pcpu_rwlock);
>> write_lock(&pcpu_rwlock->global_rwlock);
>> }
>>
>> void percpu_write_unlock(struct percpu_rwlock *pcpu_rwlock)
>> {
>> + /*
>> + * Inform all readers that we are done, so that they can switch back
>> + * to their per-cpu refcounts. (We don't need to wait for them to
>> + * see it).
>> + */
>> + announce_writer_inactive(pcpu_rwlock);
>> write_unlock(&pcpu_rwlock->global_rwlock);
>> }
>>
>>

Thanks a lot for your detailed review and comments! :-)

Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Srivatsa S. Bhat: "Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design ofPer-CPU Reader-Writer Locks"
Previous message: Pavel Machek: "Re: PAE problems was [RFC] Reproducible OOM with just a few sleeps"
In reply to: Paul E. McKenney: "Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design ofPer-CPU Reader-Writer Locks"
Next in thread: Paul E. McKenney: "Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design ofPer-CPU Reader-Writer Locks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]