Re: [PATCH, RFC, tip/core/rcu] scalable classic RCU implementation
From: Paul E. McKenney
Date: Tue Aug 26 2008 - 12:05:58 EST
On Mon, Aug 25, 2008 at 03:02:30PM -0700, Josh Triplett wrote:
> On Fri, 2008-08-22 at 18:53 -0700, Paul E. McKenney wrote:
> > On Fri, Aug 22, 2008 at 04:29:32PM -0700, Josh Triplett wrote:
> > > On Thu, 2008-08-21 at 16:43 -0700, Paul E. McKenney wrote:
> > > > - spinlock_t lock ____cacheline_internodealigned_in_smp;
> > > > - cpumask_t cpumask; /* CPUs that need to switch in order */
> > > > - /* for current batch to proceed. */
> > > > +/*
> > > > + * Definition for node within the RCU grace-period-detection hierarchy.
> > > > + */
> > > > +struct rcu_node {
> > > > + spinlock_t lock;
> > > > + unsigned long qsmask; /* CPUs or groups that need to switch in */
> > > > + /* order for current grace period to proceed.*/
> > > > + unsigned long qsmaskinit;
> > > > + /* Per-GP initialization for qsmask. */
> > > > + int grplo; /* lowest-numbered CPU or group here. */
> > > > + int grphi; /* highest-numbered CPU or group here. */
> > > > + char grpnum; /* CPU/group number for next level up. */
> > > > + char level; /* root is at level 0. */
> > >
> > > These four fields should use sized types, and preferably unsigned types.
> >
> > OK for grpnum and level, but grphi and grplo need to be "int" to
> > match the various CPU-manipulation primitives.
>
> Fair enough; the CPU-manipulation primitives do indeed use "int". Odd
> that they use a signed type.
It does allow use of -1 for "no particular CPU" or for error checking,
which can sometimes be useful.
> > > > + struct rcu_node *parent;
> > > > } ____cacheline_internodealigned_in_smp;
> > > >
> > > > -/* Is batch a before batch b ? */
> > > > -static inline int rcu_batch_before(long a, long b)
> > > > -{
> > > > - return (a - b) < 0;
> > > > -}
> > > > +/*
> > > > + * RCU global state, including node hierarchy. This hierarchy is
> > > > + * represented in "heap" form in a dense array. The root (first level)
> > > > + * of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
> > > > + * level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]),
> > > > + * and the third level in ->node[m+1] and following (->node[m+1] referenced
> > > > + * by ->level[2]). The number of levels is determined by the number of
> > > > + * CPUs and by CONFIG_RCU_FANOUT. Small systems will have a "hierarchy"
> > > > + * consisting of a single rcu_node.
> > > > + */
> > > > +struct rcu_state {
> > > > + struct rcu_node node[NUM_RCU_NODES]; /* Hierarchy. */
> > > > + struct rcu_node *level[NUM_RCU_LEVELS]; /* Hierarchy levels. */
> > > > + int levelcnt[MAX_RCU_LEVELS + 1]; /* # nodes in each level. */
> > > > + int levelspread[NUM_RCU_LEVELS]; /* kids/node in each level. */
> > >
> > > These two should use sized types.
> >
> > Fair enough. And can be 8 bits, for that matter.
>
> levelspread can, since it will never exceed 64, but levelcnt cannot.
> That would lead to a bug on systems with more than 256 CPUs.
Good catch!!! Fixed.
> > > > +
> > > > + /* The following fields are guarded by the root rcu_node's lock. */
> > > > +
> > > > + char signaled ____cacheline_internodealigned_in_smp;
> > > > + /* sent GP-kick IPIs? */
> > >
> > > u8 or bool, depending on semantics. If just a simple flag, how about
> > > bool?
> >
> > This will need to be a non-bool shortly.
>
> OK.
>
> > OK, so what the heck -are- the official type names??? u8 seems
> > to be defined in a powerpc-specific file. OK, it also appears in
> > include/asm-generic/int-l64.h. s8, u8, s16, u16, s32, u32, s64, and
> > u64, then?
>
> Yes. {s,u}{8,16,32,64}, defined in include/asm-generic/int-{l,ll}64.h,
> depending on architecture.
Got it!
> > > > int cpu;
> > > > struct rcu_head barrier;
> > > > };
> > > >
> > > > +extern struct rcu_state rcu_state;
> > > > DECLARE_PER_CPU(struct rcu_data, rcu_data);
> > > > +
> > > > +extern struct rcu_state rcu_bh_state;
> > > > DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
> > >
> > > Why extern and in the header? I don't see anything else using them.
> >
> > kernel/rcuclassic_trace.c, right?
>
> Hmmm, true. Unfortunate, particularly if only for the benefit of
> tracing code which doesn't even get compiled under normal circumstances.
Indeed. Putting rcuclassic_trace.c into rcuclassic.c gets pretty ugly.
I suppose that another possibility would be to #include rcuclassic_trace.c
into rcuclassic.c, which might actually be the best approach.
> > > > select DEBUG_FS
> > > > default y
> > > > help
> > > > @@ -77,3 +76,33 @@ config RCU_TRACE
> > > >
> > > > Say Y here if you want to enable RCU tracing
> > > > Say N if you are unsure.
> > > > +
> > > > +config RCU_FANOUT
> > > > + int "Hierarchical RCU fanout value"
> > > > + range 2 64 if 64BIT
> > > > + range 2 32 if !64BIT
> > > > + depends on CLASSIC_RCU
> > > > + default 64 if 64BIT
> > > > + default 32 if !64BIT
> > > > + help
> > > > + This option controls the fanout of hierarchical implementations
> > > > + of RCU, allowing RCU to work efficiently on machines with
> > > > + large numbers of CPUs. This value must be at least the cube
> > > > + root of NR_CPUS, which allows NR_CPUS up to 32,768 for 32-bit
> > > > + systems and up to 262,144 for 64-bit systems.
> > > > +
> > > > + Select a specific number if testing RCU itself.
> > >
> > > ...or if attempting to tune for a specific NUMA system.
> >
> > Indeed. But I need to see an actual example before I document it.
> > It would be easy to make things slower by following the NUMA hardware
> > layout.
>
> Fair enough.
>
> > > > + Take the default if unsure.
> > > > +
> > > > +config RCU_FANOUT_EXACT
> > > > + bool "Disable hierarchical RCU auto-balancing"
> > > > + depends on CLASSIC_RCU
> > > > + default n
> > > > + help
> > > > + This option forces use of the exact RCU_FANOUT value specified,
> > > > + regardless of imbalances in the hierarchy. This can be useful
> > > > + on systems with strong NUMA behavior.
> > > > +
> > > > + Without RCU_FANOUT_EXACT, the code will balance the hierarchy.
> > >
> > > You might want to give a specific example of a NUMA machine, the
> > > appropriate value to use on that machine, and the result with and
> > > without RCU_FANOUT_EXACT.
> >
> > Or change "can" to "might". ;-)
>
> :)
>
> Right, my comment only applies if such an example actually exists. :)
Hopefully we have correctly tuned the uncertainty.
> > > > -static int blimit = 10;
> > > > -static int qhimark = 10000;
> > > > -static int qlowmark = 100;
> > > > +static int blimit = 10; /* Maximum callbacks per softirq. */
> > > > +static int qhimark = 10000; /* If this many pending, ignore blimit. */
> > > > +static int qlowmark = 100; /* Once only this many pending, use blimit. */
> > >
> > > Indentation mismatch on the comments?
> >
> > Looks fine in the source -- context diff-ism.
>
> Sigh. Yay for tabs.
>
> > > > #ifdef CONFIG_SMP
> > > > -static void force_quiescent_state(struct rcu_data *rdp,
> > > > - struct rcu_ctrlblk *rcp)
> > > > +static void force_quiescent_state(struct rcu_state *rsp)
> > > > {
> > > > int cpu;
> > > > - cpumask_t cpumask;
> > > > unsigned long flags;
> > > >
> > > > set_need_resched();
> > > > - spin_lock_irqsave(&rcp->lock, flags);
> > > > - if (unlikely(!rcp->signaled)) {
> > > > - rcp->signaled = 1;
> > > > + if (!spin_trylock_irqsave(&rsp->onofflock, flags))
> > > > + return;
> > >
> > > This seems to make force_quiescent_state rather less forceful.
> >
> > It will try again on the next scheduling-clock interrupt. The reason
> > I did this is because ->onofflock is a global lock acquired when
> > beginning a quiescent state or when onlining/offlining. Can't let
> > force_quiescent_state() monopolize things, and would like to exclude
> > online/offline while doing force_quiescent_state(). Hence make
> > force_quiescent_state() back off if the lock is held.
> >
> > There is probably a better way to do this...
>
> Primarily concerned about the possibility of perpetual failure. Then
> again, eventually a grace period will occur "naturally". Just wondering
> whether the inability to force might cause a problem.
Ah! So the lock can fail for the following reasons:
1. Some other CPU is in force_quiescent_state(). Here there is
clearly no problem.
2. Some other CPU is initializing the rcu_node hierarchy to set
up a new quiescent state. Here, we shouldn't have been
executing force_quiescent_state() in the first place, so
again no problem.
3. Some other CPU is adjusting the rcu_node hierarchy to account
for a CPU online or offline operation. There is enough overhead
in onlining and offlining CPUs that it seems unlikely that this
could result in a denial of service. However, if someone can
make this happen, I will make the online/offline operation check
to see if it should do a force_quiescent_state() -- which will
require an __force_quiescent_state() where the onofflock is
acquired by the caller.
So we are covered on #1 and #2, and very likely covered on #3, with an
easy fix if I am wrong.
> > > > -#else
> > > > +#else /* #ifdef CONFIG_HOTPLUG_CPU */
> > > >
> > > > -static void rcu_offline_cpu(int cpu)
> > > > +static inline void
> > > > +rcu_offline_cpu(int cpu)
> > > > {
> > > > }
> > >
> > > No need to explicitly say "inline"; GCC should do the right thing here.
> > > Same comment applies a couple of other places in your patch.
> >
> > OK, I will get rid of these. You can do the other 26,000 of them. ;-)
>
> :)
>
> > > > @@ -658,14 +806,19 @@ int rcu_needs_cpu(int cpu)
> > > > struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> > > > struct rcu_data *rdp_bh = &per_cpu(rcu_bh_data, cpu);
> > > >
> > > > - return !!rdp->nxtlist || !!rdp_bh->nxtlist || rcu_pending(cpu);
> > > > + return !!*rdp->nxttail[RCU_DONE_TAIL] ||
> > > > + !!*rdp_bh->nxttail[RCU_DONE_TAIL] ||
> > > > + rcu_pending(cpu);
> > >
> > > !! seems unnecessary here.
> >
> > Someone once told me why this was necessary, but I forget. It was in the
> > original, and I didn't put it there. Some weirdness about conversion
> > to 32-bit integer when the lower 32 bits of the pointer was zero or
> > some such. So if your pointer value was 0x100000000, for example,
> > so that conversion to int gives zero.
>
> Good point! That doesn't apply if you use ||, though. If you just did
> "return somepointer" that could potentially cause the problem you
> describe. In any case, it can't *hurt* to have it; GCC should do the
> sane thing.
OK. I will review this towards the end, leaving it there to remind me
in the meantime.
So, would I need the !! on the left-hand operand of the first || due
to short-circuiting?
> > > > +void call_rcu_bh(struct rcu_head *head,
> > > > + void (*func)(struct rcu_head *rcu))
> > > > +{
> > > > + unsigned long flags;
> > > > +
> > > > + head->func = func;
> > > > + head->next = NULL;
> > > > + local_irq_save(flags);
> > > > + __call_rcu(head, &rcu_bh_state, &__get_cpu_var(rcu_bh_data));
> > > > + local_irq_restore(flags);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(call_rcu_bh);
> > >
> > > This comment applies to the original code, but:
> > > You only call __call_rcu twice, in call_rcu and call_rcu_bh. Both
> > > times, you set head first, then wrap the call with local_irq_save. How
> > > about moving both into __call_rcu, making call_rcu and call_rcu_bh
> > > one-liners?
> >
> > I can't pass "rcu_data" to a function (or at least I don't know how to
> > do so, short of passing __per_cpu_rcu_data and doing the per-CPU stuff
> > by hand). I could make __call_rcu() be a macro, but that seemed more
> > ugly than it seemed worthwhile.
> >
> > Is there some other approach that would work?
>
> Hmmm. No, not that I know of. Sigh.
The only other thing I can think of is dynamically allocated per-CPU
variables, which seemed more ugly than helpful in this case.
> > > > +static char *rcuclassic_trace_buf;
> > > > +#define RCUPREEMPT_TRACE_BUF_SIZE 4096
> > >
> > > Did you perhaps want PAGE_SIZE?
> >
> > I really want some way of gracefully handling arbitrarily long output
> > to debugfs. I am sure that some such exists, but haven't found it.
> > What I do instead is to arbitrarily truncate output to 4096 bytes,
> > which will be stunningly less than useful on a 4,096-CPU machine. :-/
> >
> > Suggestions welcome!
>
> I can see two possibilities, depending on how much complexity you want.
>
> The complicated way: do one pass calling snprintf everywhere and adding
> up the total length used, and if you run out of memory during that pass,
> reallocate the buffer to at least the total length you accumulated. Or
> something like that.
>
> The simple hack:
> #define RCUPREEMPT_TRACE_BUF_SIZE (NR_CPUS * something)
>
> :)
Given that this doesn't show up in production kernels, I will take
door #2. Though I was hoping for some sort of interface that "just
made it work" regardless of the size of user reads and the length
and pattern of in-kernel prints, but that might be a bit much...
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/