Re: RCU qsmask !=0 warnings on large-SMP...

From: Steffen Persvold
Date: Wed Jan 25 2012 - 15:35:20 EST


On 1/25/2012 19:14, Paul E. McKenney wrote:
[]
CONFIG_NO_HZ is not set, so it should not happen. We see that the behavior is the same with CONFIG_NO_HZ=y though, but it takes longer to reproduce usually.

OK, the CONFIG_NO_HZ=n has the least code involved, so it would be best
for debugging.

Good, that was my thought also when looking at the code. I'm reducing NR_CPUS to 512 now to get two levels just for simpler debugging (the issue is still present).

[]
Because the RCU tree is 3 levels, the printout function we added in the patch gets called 3 times each time with the same RDP but with different RNPs (in rcu_start_gp()).

Ah, good point. Hmmm...

Looking back at Daniel's original email, we have the root rcu_node
structure with ->qsmask=0x1 (indicating first descendent), the next
level having ->qsmask=0x8 (indicating fourth descendent) and the last
level having ->qsmask=0x1, again indicating first descendent. So,
zero, 16, 32, 48. Which agrees with the CPU number below that has not
yet caught up to the current grace period.

Another really odd thing... If we are starting the new grace period,
we should have incremented rsp->gpnum. And in fact, we do have
rsp->gpnum being one larger than rsp->completed, as expected. But
if we have only initialized the root rcu_node structures, how can
the per-CPU rcudata structures know about the new grace period yet?

There was a time when the CPUs could find out early, but I think that
was a long time ago. Yes, check_for_new_grace_period() does compare
rdp->gpnum against rsp->gpnum, but it calls note_new_gpnum() which
acquires the rnp->lock(), and nothing happens unless __note_new_gpnum()
sees that rnp->gpnum is different rdp->gpnum.

So, it would be very interesting to add the values rdp->mynode->gpnum
and rdp->mynode->completed to your list, perhaps labeling them something
like "rng" and "rnc" respectively.

I will add this to the printout.


Of course, CPU 48 should not have decided that it was done with the
old grace period before clearing its bit. For that matter, some
CPU somewhere decided that the grace period was done despite the
root rcu_node's ->qsmask being non-zero, which should be prevented
by the:

if (rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {

line in rcu_report_qs_rnp().

3.2 has some event tracing that would be extremely helpful in tracking
this down. Are you able to run 3.2?

Yes, 3.2.1 is our debug target right now.

[]
Same here, but most of the ql= values are larger. Later printout?

The loop in rcu_start_gp() releases the node lock between each time it gets a new level in the RCU tree (it has to) :

rcu_for_each_node_breadth_first(rsp, rnp) {
raw_spin_lock(&rnp->lock); /* irqs already disabled. */
rcu_debug_print(rsp, rnp);

so I guess this allows ql= values to increase maybe, no ?

The ql= levels can increase anyway -- those queues are only accessed by
the corresponding CPU or from stop_machine context. The small increases
are entirely consistent with your having bits set at all levels of the
rcu_node tree. The reason I was surprised is that my earlier bugs (as
in before the code hit mainline) only ever resulted in a single level
having a stray bit.

Ok.

[]

Thanks for looking into this Paul, we'd be more than happy to test out theories and patches.

The event tracing, particularly the "rcu_grace_period" set, would be
very helpful.

Are you talking about the data from /sys/kernel/debug/rcu/ ? I have CONFIG_RCU_TRACE (and consequently CONFIG_TREE_RCU_TRACE) set, is this enough to get the event data you want ?

Cheers,
--
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/