Re: numa/core regressions fixed - more testers wanted

From: Andrew Theurer
Date: Tue Nov 20 2012 - 20:54:11 EST

Next message: Alex Courbot: "Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences"
Previous message: Chuansheng Liu: "[PATCH] watchdog: using u64 in get_sample_period()"
In reply to: Ingo Molnar: "numa/core regressions fixed - more testers wanted"
Next in thread: Rik van Riel: "Re: numa/core regressions fixed - more testers wanted"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 2012-11-20 at 18:56 +0100, Ingo Molnar wrote:
> * Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> > ( The 4x JVM regression is still an open bug I think - I'll
> > re-check and fix that one next, no need to re-report it,
> > I'm on it. )
>
> So I tested this on !THP too and the combined numbers are now:
>
> |
> [ SPECjbb multi-4x8 ] |
> [ tx/sec ] v3.7 | numa/core-v16
> [ higher is better ] ----- | -------------
> |
> +THP: 639k | 655k +2.5%
> -THP: 510k | 517k +1.3%
>
> So it's not a regression anymore, regardless of whether THP is
> enabled or disabled.
>
> The current updated table of performance results is:
>
> -------------------------------------------------------------------------
> [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> [ lower is better ] ----- -------- | ------------- -----------
> |
> numa01 340.3 192.3 | 139.4 +144.1%
> numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> numa02 56.1 25.3 | 17.5 +220.5%
> |
> [ SPECjbb transactions/sec ] |
> [ higher is better ] |
> |
> SPECjbb 1x32 +THP 524k 507k | 638k +21.7%
> SPECjbb 1x32 !THP 395k | 512k +29.6%
> |
> -----------------------------------------------------------------------
> |
> [ SPECjbb multi-4x8 ] |
> [ tx/sec ] v3.7 | numa/core-v16
> [ higher is better ] ----- | -------------
> |
> +THP: 639k | 655k +2.5%
> -THP: 510k | 517k +1.3%
>
> So I think I've addressed all regressions reported so far - if
> anyone can still see something odd, please let me know so I can
> reproduce and fix it ASAP.

I can confirm single JVM JBB is working well for me. I see a 30%
improvement over autoNUMA. What I can't make sense of is some perf
stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):

tips numa/core:

5,429,632,865 node-loads
3,806,419,082 node-load-misses(70.1%)
2,486,756,884 node-stores
2,042,557,277 node-store-misses(82.1%)
2,878,655,372 node-prefetches
2,201,441,900 node-prefetch-misses

autoNUMA:

4,538,975,144 node-loads
2,666,374,830 node-load-misses(58.7%)
2,148,950,354 node-stores
1,682,942,931 node-store-misses(78.3%)
2,191,139,475 node-prefetches
1,633,752,109 node-prefetch-misses

The percentage of misses is higher for numa/core. I would have expected
the performance increase be due to lower "node-misses", but perhaps I am
misinterpreting the perf data.

One other thing I noticed was both tests are not even using all CPU
(75-80%), so I suspect there's a JVM scalability issue with this
workload at this number of cpu threads (80). This is a IBM JVM, so
there may be some differences. I am curious if any of the others
testing JBB are getting 100% cpu utilization at their warehouse peak.

So, while the performance results are encouraging, I would like to
correlate it with some kind of perf data that confirms why we think it's
better.

>
> Next I'll work on making multi-JVM more of an improvement, and
> I'll also address any incoming regression reports.

I have issues with multiple KVM VMs running either JBB or
dbench-in-tmpfs, and I suspect whatever I am seeing is similar to
whatever multi-jvm in baremetal is. What I typically see is no real
convergence of a single node for resource usage for any of the VMs. For
example, when running 8 VMs, 10 vCPUs each, a VM may have the following
resource usage:

host cpu usage from cpuacct cgroup:
/cgroup/cpuacct/libvirt/qemu/at-vm01

node00 node01 node02 node03
199056918180|005% 752455339099|020% 1811704146176|049% 888803723722|024%

And VM memory placement in host(in pages):
node00 node01 node02 node03
107566|023% 115245|025% 117807|025% 119414|025%

Conversely, autoNUMA usually has 98+% for cpu and memory in one of the
host nodes for each of these VMs. AutoNUMA is about 30% better in these
tests.

That is data for the entire run time, and "not converged" could possibly
mean, "converged but moved around", but I doubt that's what happening.

Here's perf data for the dbench VMs:

numa/core:

468,634,508 node-loads
210,598,643 node-load-misses(44.9%)
172,735,053 node-stores
107,535,553 node-store-misses(51.1%)
208,064,103 node-prefetches
160,858,933 node-prefetch-misses

autoNUMA:

666,498,425 node-loads
222,643,141 node-load-misses(33.4%)
219,003,566 node-stores
99,243,370 node-store-misses(45.3%)
315,439,315 node-prefetches
254,888,403 node-prefetch-misses

These seems to make a little more sense to me, but the percentages for
autoNUMA still seem a little high (but at least lower then numa/core).
I need to take a manually pinned measurement to compare.

> Those of you who would like to test all the latest patches are
> welcome to pick up latest bits at tip:master:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

I've been running on numa/core, but I'll switch to master and try these
again.

Thanks,

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alex Courbot: "Re: [PATCHv9 1/3] Runtime Interpreted Power Sequences"
Previous message: Chuansheng Liu: "[PATCH] watchdog: using u64 in get_sample_period()"
In reply to: Ingo Molnar: "numa/core regressions fixed - more testers wanted"
Next in thread: Rik van Riel: "Re: numa/core regressions fixed - more testers wanted"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]