Re: [PATCH 1/8] THP: Use real address for NUMA policy

From: Alex Thorlton
Date: Tue Aug 27 2013 - 12:50:45 EST

Next message: Thomas Petazzoni: "Re: [RFC v1 0/5] ARM: Initial support for Marvell Armada 1500"
Previous message: jbaron: "[PATCH] dynamic debug: line queries failing due to uninitialized local variable"
Next in thread: Robin Holt: "Re: [PATCH 1/8] THP: Use real address for NUMA policy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> Here's more up-to-date version: https://lkml.org/lkml/2012/8/20/337

These don't seem to give us a noticeable performance change either:

With THP:

real 22m34.279s
user 10797m35.984s
sys 39m18.188s

Without THP:

real 4m48.957s
user 2118m23.208s
sys 113m12.740s

Looks like we got a few minutes faster on the with THP case, but it's
still significantly slower, and that could just be a fluke result; we're
still floating at about a 5x performance degradation.

I talked with one of our performance/benchmarking experts last week and
he's done a bit more research into the actual problem here, so I've got
a bit more information:

The real performance hit, based on our testing, seems to be coming from
the increased latency that comes into play on large NUMA systems when a
process has to go off-node to read from/write to memory.

To give an extreme example, say we have a 16 node system with 8 cores
per node. If we have a job that shares a 2MB data structure between 128
threads, with THP on, the first thread to touch the structure will
allocate all 2MB of space for that structure in a 2MB page, local to its
socket. This means that all the memory accessses for the other 120
threads will be remote acceses. With THP off, each thread could locally
allocate a number of 4K pages sufficient to hold the chunk of the
structure on which it needs to work, significantly reducing the number
of remote accesses that each thread will need to perform.

So, with that in mind, do we agree that a per-process tunable (or
something similar) to control THP seems like a reasonable method to
handle this issue?

Just want to confirm that everyone likes this approach before moving
forward with another revision of the patch. I'm currently in favor of
moving this to a per-mm tunable, since that seems to make more sense
when it comes to threaded jobs. Also, a decent chunk of the code I've
already written can be reused with this approach, and prctl will still
be an appropriate place from which to control the behavior. Andrew
Morton suggested possibly controlling this through the ELF header, but
I'm going to lean towards the per-mm route unless anyone has a major
objection to it.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Thomas Petazzoni: "Re: [RFC v1 0/5] ARM: Initial support for Marvell Armada 1500"
Previous message: jbaron: "[PATCH] dynamic debug: line queries failing due to uninitialized local variable"
Next in thread: Robin Holt: "Re: [PATCH 1/8] THP: Use real address for NUMA policy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]