Re: RCU scaling on large systems

From: Jack Steiner
Date: Mon May 03 2004 - 13:43:52 EST

Next message: Ken Ashcraft: "[CHECKER] Resource leaks in driver shutdown functions"
Previous message: Linus Torvalds: "Re: 2.6.6-rc3: modular DVB tda1004x broken"
In reply to: Paul E. McKenney: "Re: RCU scaling on large systems"
Next in thread: Paul E. McKenney: "Re: RCU scaling on large systems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Paul -

Thanks for the reply.

Additional data from experiments today.

As expected, there are multiple hot spots related to the rcu_ctrlblk.

- scheduler_tick() in the rcu_pending macro. Specifically, on the
load of the rcu_cpu_mask.

- rcu_check_quiescent_state() on spin_lock(&rcu_ctrlblk.mutex);

These two spots are are ~equally hot.

Some of the cache line contention could be alleviated by separating
these fields into multiple cache lines. Wli posted a patch over the
weekend that does that. I have not had a chance to review the patch in
detail but it looks a reasonable idea.

-----------------
Response to your mail:

> >From your numbers below, I would guess that if you have at least
> 8 CPUs per NUMA node, a two-level tree would suffice. If you have
> only 4 CPUs per NUMA node, you might well need a per-node level,
> a per-4-nodes level, and a global level to get the global lock
> contention reduced sufficiently.

The system consists of 256 nodes. Each node has 2 cpus located on
a shared FSB. The nodes are packaged as 128 modules - 2 nodes per
module. The 2 nodes in a module are slightly "closer" latency-wise
than nodes in different modules.

>
> > I also found an interesting anomaly that was traced to RCU. I have
> > a program that measures "cpu efficiency". Basically, the program
> > creates a cpu bound thread for each cpu & measures the percentage
> > of time that each cpu ACTUALLY spends executing user code.
> > On an idle each system, each cpu *should* spend >99% in user mode.
> >
> > On a 512p idle 2.6.5 system, each cpu spends ~6% of the time in the kernel
> > RCU code. The time is spent contending for shared cache lines.
>
> Again, no surprise, Linux's RCU was not designed for a 512-CPU
> system. ;-)
>
> The hierarchical grace-period-detection scheme described above
> also increases cache locality, greatly reducing the cache-thrashing
> you are seeing.
>
> > Even more bizarre: if I repeatedly type "ls" in a *single* window
> > (probably 5 times/sec), then EVERY CPU IN THE SYSTEM spends ~50%
> > of the time in the RCU code.
>
> Hmmm... How large was the directory you were running "ls" on?
> At first blush, it sounds like the "ls" was somehow provoking
> a dcache update, which would then exercise RCU.

The directory size does not seem to be too significant. I tried one test
on a NFS directory with 250 files. Another test on /tmp with 25 files.
In both cases, the results were similar.

>
> > The RCU algorithms don't scale - at least on our systems!!!!!
>
> As noted earlier, the current implementation is not designed for
> 512 CPUs. And, as noted earlier, there are ways to make it
> scale. But for some reason, we felt it advisable to start with
> a smaller, simpler, and hence less scalable implementation. ;-)

Makes sense. I would not have expected otherwise. Overall, linux scales
to 512p much better than I would have predicted.

Is anyone working on improving RCU scaling to higher cpu counts. I
dont want to duplicate any work that is already in progress.
Otherwise, I'll start investigating what can be done to improve
scaling.

>
> > Attached is an experimental hack that fixes the problem. I
> > don't believe that this is the correct fix but it does prove
> > that slowing down the frequency of updates fixes the problem.
> >
> >
> > With this hack, "ls" no longer measurable disturbs other cpus. Each
> > cpu spends ~99.8% of its time in user code regardless of the frequency
> > of typing "ls".
> >
> >
> >
> > By default, the RCU code attempts to process callbacks on each cpu
> > every tick. The hack adds a mask so that only a few cpus process
> > callbacks each tick.
>
> Cute! However, it is not clear to me that this approach is
> compatible with real-time use of RCU, since it results in CPUs
> processing their callbacks less frequently, and thus getting
> more of them to process at a time.
>
> But it is not clear to me that anyone is looking for realtime
> response from a 512-CPU machine (yow!!!), so perhaps this
> is not a problem...

Agree on both counts.

--
Thanks

Jack Steiner (steiner@xxxxxxx) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ken Ashcraft: "[CHECKER] Resource leaks in driver shutdown functions"
Previous message: Linus Torvalds: "Re: 2.6.6-rc3: modular DVB tda1004x broken"
In reply to: Paul E. McKenney: "Re: RCU scaling on large systems"
Next in thread: Paul E. McKenney: "Re: RCU scaling on large systems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]