Re: [PATCH 1/8] THP: Use real address for NUMA policy

From: Alex Thorlton
Date: Mon Sep 09 2013 - 12:48:22 EST


On Thu, Sep 05, 2013 at 01:15:10PM +0200, Ingo Molnar wrote:
>
> * Alex Thorlton <athorlton@xxxxxxx> wrote:
>
> > > Robin,
> > >
> > > I tweaked one of our other tests to behave pretty much exactly as I
> > > - malloc a large array
> > > - Spawn a specified number of threads
> > > - Have each thread touch small, evenly spaced chunks of the array (e.g.
> > > for 128 threads, the array is divided into 128 chunks, and each thread
> > > touches 1/128th of each chunk, dividing the array into 16,384 pieces)
> >
> > Forgot to mention that the threads don't touch their chunks of memory
> > concurrently, i.e. thread 2 has to wait for thread 1 to finish first.
> > This is important to note, since the pages won't all get stuck on the
> > first node without this behavior.
>
> Could you post the testcase please?
>
> Thanks,
>
> Ingo

Sorry for the delay here, had to make sure that everything in my tests
was okay to push out to the public. Here's a pointer to the test I
wrote:

ftp://shell.sgi.com/collect/appsx_test/pthread_test.tar.gz

Everything to compile the test should be there (just run make in the
thp_pthread directory). To run the test use something like:

time ./thp_pthread -C 0 -m 0 -c <max_cores> -b <memory>

I ran:

time ./thp_pthread -C 0 -m 0 -c 128 -b 128g

On a 256 core machine, with ~500gb of memory and got these results:

THP off:

real 0m57.797s
user 46m22.156s
sys 6m14.220s

THP on:

real 1m36.906s
user 0m2.612s
sys 143m13.764s

I snagged some code from another test we use, so I can't vouch for the
usefulness/accuracy of all the output (actually, I know some of it is
wrong). I've mainly been looking at the total run time.

Don't want to bloat this e-mail up with too many test results, but I
found this one really interesting. Same machine, using all the cores,
with the same amount of memory. This means that each cpu is actually
doing *less* work, since the chunk we reserve gets divided up evenly
amongst the cpus:

time ./thp_pthread -C 0 -m 0 -c 256 -b 128g

THP off:

real 1m1.028s
user 104m58.448s
sys 8m52.908s

THP on:

real 2m26.072s
user 60m39.404s
sys 337m10.072s

Seems that the test scales really well in the THP off case, but, once
again, with THP on, we really see the performance start to degrade.

I'm planning to start investigating possible ways to split up THPs, if
we detect that that majority of the references to a THP are off-node.
I've heard some horror stories about migrating pages in this situation
(i.e., process switches cpu and then all the pages follow it), but I
think we might be able to get some better results if we can cleverly
determine an appropriate time to split up pages. I've heard a bit of
talk about doing something similar to this from a few people, but
haven't seen any code/test results. If anybody has any input on that
topic, it would be greatly appreciated.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/