Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

From: Andrea Arcangeli
Date: Wed Dec 05 2018 - 19:31:34 EST


On Wed, Dec 05, 2018 at 02:10:47PM -0800, David Rientjes wrote:
> I've must have said this at least six or seven times: fault latency is

In your original regression report in this thread to Linus:

https://lkml.kernel.org/r/alpine.DEB.2.21.1811281504030.231719@xxxxxxxxxxxxxxxxxxxxxxxxx

you said "On a fragmented host, the change itself showed a 13.9%
access latency regression on Haswell and up to 40% allocation latency
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
regression. This is more substantial on Naples and Rome. I also
^^^^^^^^^^
measured similar numbers to this for Haswell."

> secondary to the *access* latency. We want to try hard for MADV_HUGEPAGE
> users to do synchronous compaction and try to make a hugepage available.

I'm glad you said it six or seven times now, because you forgot to
mention in the above email that the "40% allocation/fault latency
regression" you reported above, is actually a secondary concern because
those must be long lived allocations and we can't yet generate
compound pages for free after all..

> We really want to be backed by hugepages, but certainly not when the
> access latency becomes 13.9% worse as a result compared to local pages of
> the native page size.

Yes the only regression that you measure that isn't only a secondary
concern, is that 13.9% access latency because of not immediate NUMA
locality.

BTW, I never bothered to ask yet, but, did you enable NUMA balancing
in your benchmarks? NUMA balancing would fix the access latency very
easily too, so that 13.9% access latency must quickly disappear if you
correctly have NUMA balancing enabled in a NUMA system.

Furthermore NUMA balancing is fully converging guaranteed if the
workload can fit in a single node (your case or __GFP_THISNODE would
hardly fly in the first place). It'll work even better for you because
you copy off all MAP_PRIVATE binaries into MAP_ANON to make them THP
backed so they can also be replicated per NODE and they won't increase
the NUMA balancing false sharing. And khugepaged always remains NUMA
agnostic so it won't risk to stop on NUMA balancing toes no matter how
we tweak the MADV_HUGEPAGE behavior.

> This is not a system-wide configuration detail, it is specific to the
> workload: does it span more than one node or not? No workload that can
> fit into a single node, which you also say is going to be the majority of
> workloads on today's platforms, is going to want to revert __GFP_THISNODE
> behavior of the past almost four years. It perfectly makes sense,
> however, to be a new mempolicy mode, a new madvise mode, or a prctl.

qemu has been using MADV_HUGEPAGE since the below commit in Oct 2012.