Re: [patch 0/2 for-4.20] mm, thp: fix remote access and allocation regressions

From: David Rientjes
Date: Wed Dec 05 2018 - 14:41:35 EST


On Wed, 5 Dec 2018, Mel Gorman wrote:

> > This is a single MADV_HUGEPAGE usecase, there is nothing special about it.
> > It would be the same as if you did mmap(), madvise(MADV_HUGEPAGE), and
> > faulted the memory with a fragmented local node and then measured the
> > remote access latency to the remote hugepage that occurs without setting
> > __GFP_THISNODE. You can also measure the remote allocation latency by
> > fragmenting the entire system and then faulting.
> >
>
> I'll make the same point as before, the form the fragmentation takes
> matters as well as the types of pages that are resident and whether
> they are active or not. It affects the level of work the system does
> as well as the overall success rate of operations (be it reclaim, THP
> allocation, compaction, whatever). This is why a reproduction case that is
> representative of the problem you're facing on the real workload matters
> would have been helpful because then any alternative proposal could have
> taken your workload into account during testing.
>

We know from Andrea's report that compaction is failing, and repeatedly
failing because otherwise we would not need excessive swapping to make it
work. That can mean one of two things: (1) a general low-on-memory
situation that causes us repeatedly to be under watermarks to deem
compaction suitable (isolate_freepages() will be too painful) or (2)
compaction has the memory that it needs but is failing to make a hugepage
available because all pages from a pageblock cannot be migrated.

If (1), perhaps in the presence of an antagonist that is quickly
allocating the memory before compaction can pass watermark checks, further
reclaim is not beneficial: the allocation is becoming too expensive and
there is no guarantee that compaction can find this reclaimed memory in
isolate_freepages().

I chose to duplicate (2) by synthetically introducing fragmentation
(high-order slab, free every other one) locally to test the patch that
does not set __GFP_THISNODE. The result is a remote transparent hugepage,
but we do not even need to get to the point of local compaction for that
fallback to happen. And this is where I measure the 13.9% access latency
regression for the lifetime of the binary as a result of this patch.

If local compaction works the first time, great! But that is not what is
happening in Andrea's report and as a result of not setting __GFP_THISNODE
we are *guaranteed* worse access latency and may encounter even worse
allocation latency if the remote memory is fragmented as well.

So while I'm only testing the functional behavior of the patch itself, I
cannot speak to the nature of the local fragmentation on Andrea's systems.