Re: [Patch] numa:x86_64: Cacheline aliasing makesfor_each_populated_zone extremely expensive -V2.

From: Robin Holt
Date: Thu Aug 19 2010 - 16:43:12 EST

Next message: Eric Paris: "Re: [GIT PULL] notification tree - try 37!"
Previous message: Luis R. Rodriguez: "Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer FerrariOne"
In reply to: Roedel, Joerg: "Re: [Patch] numa:x86_64: Cacheline aliasing makesfor_each_populated_zone extremely expensive -V2."
Next in thread: Robin Holt: "Re: [Patch] numa:x86_64: Cacheline aliasing makesfor_each_populated_zone extremely expensive -V2."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I hope my tone in this email comes across as measured as I feel. I am
not completely confident I have not missed some thing which may have
had an effect, but I can not see what that would be. I believe I am
correct, but am willing to be shown to be wrong.

On Thu, Aug 19, 2010 at 07:30:13PM +0200, Roedel, Joerg wrote:
> On Wed, Aug 18, 2010 at 02:30:24PM -0400, Robin Holt wrote:
> > When I investigated, I noticed that all the zone_data[] structures are
> > allocated precisely at the beginning of the individual node's physical
> > memory. By simply staggering based upon nodeid, I reduced the average
> > down to 0.0006% of every second.
>
> Interesting. Have you measured cache misses for both cases?

Not directly. The change made such a compelling difference and was in
such a method as to strongly suggest it was cache aliasing.

As a result of your question, I did some back of the envelope
calculations. The shared L3 cache is 12,288 cachelines in a set, 24
ways for an 18MB cache. The sizeof(struct zone) is 1792 bytes or 28
cache lines. That means a zone would not overflow an entire set and
a single set, without the stride value, would hold one zone's info and
we would exhaust the L3 at 24 nodes. With the single line stride, we
increase that to 28 nodes per set, 24 ways, or 672 nodes before we
exhaust the L3 cache.

This seems like a plausible explanation for the problem and I do not see
any benefit to further testing unless you have a concern that I have done
something significantly wrong in my calculation above or you can see some
assumption I am making which is wrong.

> > With this patch, the max value did not change. I believe that is a
> > combination of cacheline contention updating the zone's vmstat information
> > combined with round_jiffies_common spattering unrelated cpus onto the same
> > jiffie for their next update. I will investigate those issues seperately.
>
> The max value probably the case where none of the data the code accesses
> is actually in the cache. Cache aliasing does not help then so I would
> not expect a change in the maximum amount here.

The only reason I look at the other possible causes is a very weak
data point I have. My trace code that I was using to narrow down the
slowdown area within refresh_cpu_vm_stats captured a couple data points
where a single cpu had discovered a node with pages present, the scan
of the per-cpu values and the addition of those values back into the
node_data's zone took a very long time for the two values which were
being transferred in each instance. That seems more likely to have been
seperate sockets contending for exclusive access to the same cacheline
on the node_data's zone than a cache refill issue.

I do not disagree that the time to fully refill all node_data's zone
information would take time, but I would expect that to be a very
infrequent occurrance since the round_jiffies_common() ends up scheduling
one thread on each socket to do this calculation on each tick.

> > I had no idea whether to ask stable@xxxxxxxxxx to pull this back to the
> > stable releases. My reading of the stable_kernel_rules.txt criteria is
> > only fuzzy as to whether this meets the "oh, that's not good" standard.
> > I personally think this meets that criteria, but I am unwilling to defend
> > that position too stridently. In the end, I punted and added them to
> > the Cc list. We will be asking both SuSE and RedHat to add this to
> > their upcoming update releases as we expect it to affect their customers.
>
> I don't think this is stable material. It improves performance and does
> not fix a bug. But I am not the one to decide this :-)

The only reason I think it qualifies is we are talking about 0.8% of each
cpus time. That means that on the 4096 cpu system, we are dedicating
the equivalent of 32 cpus to just vmstat_update. That feels like it
meets the "oh, that's not good" standard to me. Compare the relatively
minor chance for negative impact to the known positive impact and it
becomes slightly more compelling.

Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric Paris: "Re: [GIT PULL] notification tree - try 37!"
Previous message: Luis R. Rodriguez: "Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer FerrariOne"
In reply to: Roedel, Joerg: "Re: [Patch] numa:x86_64: Cacheline aliasing makesfor_each_populated_zone extremely expensive -V2."
Next in thread: Robin Holt: "Re: [Patch] numa:x86_64: Cacheline aliasing makesfor_each_populated_zone extremely expensive -V2."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]