Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

From: Michal Hocko
Date: Tue Dec 04 2018 - 03:48:28 EST


On Mon 03-12-18 13:53:21, David Rientjes wrote:
> On Mon, 3 Dec 2018, Michal Hocko wrote:
>
> > > I think extending functionality so thp can be allocated remotely if truly
> > > desired is worthwhile
> >
> > This is a complete NUMA policy antipatern that we have for all other
> > user memory allocations. So far you have to be explicit for your numa
> > requirements. You are trying to conflate NUMA api with MADV and that is
> > just conflating two orthogonal things and that is just wrong.
> >
>
> No, the page allocator change for both my patch and __GFP_COMPACT_ONLY has
> nothing to do with any madvise() mode. It has to do with where thp
> allocations are preferred. Yes, this is different than other memory
> allocations where it doesn't cause a 13.9% access latency regression for
> the lifetime of a binary for users who back their text with hugepages.
> MADV_HUGEPAGE still has its purpose to try synchronous memory compaction
> at fault time under all thp defrag modes other than "never". The specific
> problem being reported here, and that both my patch and __GFP_COMPACT_ONLY
> address, is the pointless reclaim activity that does not assist in making
> compaction more successful.

You do not address my concern though. Sure there are reclaim related
issues. Nobody is questioning that. But that is only half of the
problem.

The thing I am really up to here is that reintroduction of
__GFP_THISNODE, which you are pushing for, will conflate madvise mode
resp. defrag=always with a numa placement policy because the allocation
doesn't fallback to a remote node.

And that is a fundamental problem and the antipattern I am talking
about. Look at it this way. All normal allocations are utilizing all the
available memory even though they might hit a remote latency penalty. If
you do care about NUMA placement you have an API to enforce a specific
placement. What is so different about THP to behave differently. Do
we really want to later invent an API to actually allow to utilize all
the memory? There are certainly usecases (that triggered the discussion
previously) that do not mind the remote latency because all other
benefits simply outweight it?

That being said what should users who want to use all the memory do to
use as many THPs as possible?
--
Michal Hocko
SUSE Labs