Re: [PATCH, RFC] RCU: New scalable, multi-GP and preemptible RCUimplementation

From: Lai Jiangshan
Date: Mon Sep 13 2010 - 08:25:44 EST


On 09/09/2010 03:53 AM, Paul E. McKenney wrote:
> On Tue, Aug 31, 2010 at 11:03:11AM +0800, Lai Jiangshan wrote:
>> On 08/31/2010 07:57 AM, Paul E. McKenney wrote:
>>> On Mon, Aug 30, 2010 at 05:31:44PM +0800, Lai Jiangshan wrote:
>>>>
>>>> 0) contents
>>>> 1) overview
>>>> 2) scalable
>>>> 3) multi-GP
>>>> 4) integrated expedited synchronize_rcu
>>>> 5) preemptible
>>>> 6) top-down implies
>>>> 7) speech
>>>> 8) core rcuring
>>>
>>> Interesting! I can't claim to have dug through every detail, but figured
>>> I should ask a few questions first.
>>>
>>> Thanx, Paul
>>>
>>> Of course, the first question is whether it passes rcutorture, and if
>>> so on what hardware configuration.
>>
>> Only in x86(32bit and 64bit) system.
>> Did you patch it in power and test it? It' need more test for different archs,
>> but I have no other arch.
>>
>> What is "hardware configuration"?
>
> Mainly the number of CPUs. The socket/core/thread counts could also
> be helpful.
>
>>> At first glance, it looks to live somewhere between Manfred Spraul's
>>> counter-based hierarchical RCU and classic RCU. There are also some
>>> resemblances to K42's "generations" RCU-like mechanism.
>>>
>>>> 1) overview
>>>> This is a new RCU implementation: RCURING. It uses a ring atomic counters
>>>> to control the completion of GPs and a ring callbacks batches to process
>>>> the callbacks.
>>>
>>> The ring is a set of grace periods, correct?
>>
>> correct. A GP is also controled by a set ring counters in
>> (domain's done_complete, this GP's complete], this GP is only completed
>> until all these conters become zero.
>
> Understood.
>
>> I pull up a question of your here:
>>> For a given grace period, all quiescent states hit the same counter.
>>> Or am I missing something?
>>
>> No.
>> any rcu read site has a locked_complete, if a cpu hold a locked_complete,
>> the counter of this locked_complete will not become zero.
>>
>> Any quiescent states will release its previous rcu read site's locked_complete,
>> so different QS may hit different counter.
>> QS alway locks the newest GP(not started to waiting, domain's curr_complete)
>> for next rcu read site and get the new locked_complete.
>>
>> What I said seems still indistinct, more questions are wellcome, questions
>> also help for documents. Thank you very much.
>
> OK, here is my assessment thus far. First my understanding of how this
> all works:
>
> o Maintain a circular array of counters. Maintain a parallel
> circular array of callback lists.
>
> o Each non-dyntick-idle CPU holds a reference to one element of the
> array. If Ring RCU is idle, it holds a reference to the element
> that will serve the next grace period. Similarly, if a given
> CPU has passed through a quiescent state for all existing grace
> periods, it will hold a reference to the element that will serve
> for the next grace period.
>
> o Upon encountering a quiescent state, we acquire a reference to
> the element that will serve for the next grace period, and
> then release the reference to the element that we were holding.
>
> The rcuring_lock() function acquires a reference to the element
> that will serve for the next grace period. Not yet clear what
> the atomic_inc_not_zero() is for. If the reference count is zero,
> what makes it non-zero?

rcuring_lock() may be called when current CPU did not hold any reference
of rcuring, so the rcuring may completes many GPs while rcuring_lock()
is requiring reference. but rcuring_lock() should not require the finish
complete GP's reference. so I use atomic_inc_not_zero(): If the reference count
is zero, it maybe finish complete GP's reference, we should try again.

In rcuring_advance_lock(), we have no such problem, we can use rcuring_dup_lock()
actually.

>
> The rcuring_dup_lock() function acquires another reference to
> the element that the CPU already holds a reference to. This
> is used in preemptible RCU when a task is preempted in an
> RCU read-side critical section.
>
> The rcuring_unlock() function releases a reference.
>
> The rcuring_advance_lock() acquires a new reference, then
> releases the old one. This is used when a CPU passes through
> a quiescent state.
>
> o When a CPU enters dyntick-idle mode, it releases its reference.
> When it exits dyntick-idle mode (including irq and NMI), it
> acquires a referent to the element that will serve for the next
> grace period.

Yes, the rcuring will totally ignore this cpu, if it did not hold
any referent.

>
> o The __rcu_check_callbacks() function checks to see if the
> oldest grace period is now reference-free, and, if so,
> rcuring_advance_done_complete() retires the old grace periods.
> The advance_callbacks() then moves any affected callbacks to
> be invoked.
>
> o If RCU callbacks are not invoked quickly enough (10 jiffies),
> force_quiescent_state() sends IPIs to CPUs needing help
> reaching a quiescent state.
>
>
> So there were some things that struck me as being really good with
> your approach, most of which I intend to pull into the current RCU
> implementations:

I will do this, also.

>
> o Offsets into task_struct, allowing inlining of rcu_read_lock()
> and the fastpath of rcu_read_unlock().
>
> o Faster grace periods, at least when the system is busy.
>
> o Placement of rcu_copy_process() and exit_rcu() into
> include/linux/ringrcu.h. Not clear that this maps nicely
> to tree/tiny implementations, though.
>
> o Defining __rcu_barrier() in terms of __call_rcu() instead of the
> various call_rcu*() variants. Ditto for __synchronize_rcu().
>
>
> There are some things that I am unsure of, perhaps because I am still
> misunderstanting something:
>
> o The per-CPU reference-counting scheme can gain much of the
> benefit that TREE_RCU gains by handling large numbers of updates
> with a single grace period. However, CPUs that pass through
> quiescent states quickly, for example, by frequently interrupting
> out of dyntick-idle state, can incur lots of cache misses cycling
> through lots of RING RCU array elements.

It is true. I think frequently interrupting is seldom happened when
dyntick-idle state. And this can be fixed easily.

>
> o Aggressive code shrinkage through large cpp macros. You seem to
> share my ambivalence given that you use it in only some of the
> situations where it might be applied. ;-)
>
> The savings of lines of code does vary -- many definitions can
> be placed under a single #ifdef, after all.

I tried my best to abstract rcu, and to retry the common codes,
I think we can use it for rcutree.

>
> o Use of kref rather than a simple counter. Saves a line or two,
> adds a function.

I want the save the comments, kref is always inited as 1.
code itself is comments.

>
> o __rcu_check_callbacks() won't force a grace period if there
> are no callbacks waiting to be invoked. On the one hand, one
> can argue that forcing a grace period is useless in that case,
> but on the other hand, there is usually more than one CPU.
> This is probably a consequence of the force_quiescent_state()
> time being based on callback residence time rather than on
> grace-period duration -- only CPUs with callbacks can possibly
> measure a meaningful residence time.

I think it's useless to FQS when there is no callbacks in current cpu.
FQS is not scalable because is require global lock.

>
> o NMI handling? Note that raw_local_irq_save() does not
> block NMIs, so need to analyze recursion into rcu_kernel_enter()
> and rcu_kernel_exit(). Looks like it might be OK, but...
> Ironically enough, courtesy of the atomic instructions and
> cache misses that are so scary in these functions. ;-)

It's safe when NMI, I added barrier()s.

+static void __rcu_kernel_enter_outmost(int cpu)
+{
+ GEN_CODES(LOCK_COMPLETE)
+
+ barrier();
+ per_cpu(kernel_count, cpu) = 1;
+ barrier();
+
+ GEN_CODES(SAVE_COMPLETE)
+}


>
> o Eliminating rcu_pending() might be a good thing, though perhaps
> simplifying it (making it less exact) might be the right thing
> to do in TREE_RCU, particularly given that priority boosting
> turns RCU_SOFTIRQ into a kthread.
>
>
> Here are some things that could be problematic about Ring RCU. Some
> of them are fixable, others might or might not be:
>
> o Potentially high memory contention on a given ring: worst case
> is quite bad on large SMP systems. On the other hand, best
> case is really good, given that each CPU could get its own
> rcuring counter. Average case is likely to be modestly bad.
> The default value of 4 for RCURING_COUNTERS_SHIFT maps to 16
> array elements, which could easily show worst case behavior from
> time to time on large systems.

It may be true, only a global rcuring for a rcu_domain.
by it just do 6 atomic operations for a QS without any locks, will it reduce
the pain of memory contention?

It is hard to fix this if it becomes a problem, but I believe
it's very economical only 6 atomic operations for a QS.

(maybe the only way to fix: use hierarchy for RCURING)

>
> o Real-time latency could be problematic due to:
>
> o Scans across CPUs and across grace-period ring elements.

Scans across CPUs: only when FQS, but RCURING very very seldom do a FQS.
and we can fix it. call FQS in kthread or... etc.

across grace-period ring elements: we use small RCURING_IDX_MAX.

>
> o Worst-case memory contention.
>
> o High interrupt rates out of dyntick-idle state.
>
> o Potentially large memory consumption due to RCURING_IDX_MAX
> arrays in rcu_data and rcuring structures. The default of
> 16 elements should be OK.
>
> o Use of atomic instructions in dyntick path. (Not just one, but
> potentially a long string of them -- and implied cache misses.
> Understand that the expected case is a single atomic in
> rcuring_lock() -- but worst case is at least somewhat important.)
>
> Benefit is that you don't need to scan the CPUs to check for
> dyntick-idle CPUs -- force_quiescent_state() is then limited to
> resched IPIs. Unless of course the dyntick-idle CPUs remain in
> dyntick-idle state throughout, in which case they get IPIed. This
> situation limits dyntick-idle state to 10 jiffies, which would
> result in the energy-efficiency guys screaming bloody murder.
> The battery-powered embedded guys would not bother screaming,
> but could instead be expected to come after us with pitchforks.

I hate to call force_quiescent_state(), it is not scalable in RCURING/RCUTREE

In RCURING/RCUTREE, A GP is about 3 jiffies in average,
so I use 10 jiffies for FQS interval, thus force_quiescent_state()
is called rarely.

in force_quiescent_state(), I forgot to test the dest cpu's kernel_count,
It will be fixed. RCURING is Energy efficiency.

>
> o Energy efficiency: see above paragraph.

Good catch.

>
>
> The remainder are shortcomings that should be easy to fix:
>
> o force_quiescent_state() for preemptible variant not implemented.
> (Waiting on RCU priority boosting.)
>
> o Constants in __rcu_check_callbacks() really need to be
> macroized. "10"? "2 * HZ"?
>
> o rcu_do_batch() needs to self-limit in multi-CPU configurations.
> Otherwise, latency is a problem. TINY_RCU gets away without
> this (for the time being, anyway), but TREE_RCU and CLASSIC_RCU
> before it need the blimit functionality. Otherwise the
> Linux Audio guys will scream.
>
> o Too much stuff directly called from scheduling-clock-interrupt
> context! Full scan of CPUs in __force_quiescent_state(), for
> example. Need to hand off to RCU_SOFTIRQ or a kthread.
>
> o print_rcuring_stall() does not dump stacks.
>
> So, what am I still missing about your approach?
>

Thank you very much. I think the potentially high memory contention
cold be a hard problem, but will you plan to merge it.

Lai.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/