Re: [RFC PATCH] mm: thp: implement THP reservations for anonymous memory

From: Mel Gorman
Date: Sat Nov 10 2018 - 08:24:41 EST


On Fri, Nov 09, 2018 at 02:51:50PM -0500, Andrea Arcangeli wrote:
> On Fri, Nov 09, 2018 at 03:13:18PM +0300, Kirill A. Shutemov wrote:
> > I haven't yet read the patch in details, but I'm skeptical about the
> > approach in general for few reasons:
> >
> > - PTE page table retracting to replace it with huge PMD entry requires
> > down_write(mmap_sem). It makes the approach not practical for many
> > multi-threaded workloads.
> >
> > I don't see a way to avoid exclusive lock here. I will be glad to
> > be proved otherwise.
> >
> > - The promotion will also require TLB flush which might be prohibitively
> > slow on big machines.
> >
> > - Short living processes will fail to benefit from THP with the policy,
> > even with plenty of free memory in the system: no time to promote to THP
> > or, with synchronous promotion, cost will overweight the benefit.
> >
> > The goal to reduce memory overhead of THP is admirable, but we need to be
> > careful not to kill THP benefit itself. The approach will reduce number of
> > THP mapped in the system and/or shift their allocation to later stage of
> > process lifetime.
> >
> > The only way I see it can be useful is if it will be possible to apply the
> > policy on per-VMA basis. It will be very useful for malloc()
> > implementations, for instance. But as a global policy it's no-go to me.
>
> I'm also skeptical about this: the current design is quite
> intentional. It's not a bug but a feature that we're not doing the
> promotion.
>

Understood. I think with two people with extensive THP experience being
skeptical about this, we should take a step back before Anthony spends
too much more time on this. It would be a shame to work extensively on
a series just to have it rejected.

> Part of the tradeoff with THP is to use more RAM to save CPU, when you
> use less RAM you're inherently already wasting some CPU just for the
> reservation management and you don't get the immediate TLB benefit
> anymore either.
>

This is true, there is a gap where there is no THP benefit. The big
question is how many workloads, if any, suffer as a result of premature
reclaim due to sparse references of the address space consuming too much
memory. Anthony, do you have any benchmarks in mind? I don't because the
HPC workloads I'm aware of are usually sized to fit in memory regardless
of THP use.

> And if you're in the camp that is concerned about the use of more RAM
> or/and about the higher latency of COW faults, I'm afraid the
> intermediate solution will be still slower than the already available
> MADV_NOHUGEPAGE or enabled=madvise.
>

Does that not prevent huge page usage? Maybe you can spell it out a bit
better. What is the set of system calls an application should make to
not use huge pages either for the address space or on a per-VMA basis
and defer to kcompactd? I know that can be tuned globally but that's not
quite the same thing given that multiple applications or containers can
be running with different requirements.

> Now about the implementation: the whole point of the reservation
> complexity is to skip the khugepaged copy, so it can collapse in
> place. Is skipping the copy worth it? Isn't the big cost the IPI
> anyway to avoid leaving two simultaneous TLB mappings of different
> granularity?
>

Not necessarily. With THP anon in the simple case, it might be just a
single thread and kcompact so that's one IPI (kcompactd flushes local and
one IPI to the CPU the thread was running on assuming it's not migrating
excessively). It would scale up with the number of threads but I suspect
the main cost is the actual copying, page table manipulation and the
locking required.

> So if you are ok to copy the memory that you promote to THP, you'd
> just need a global THP mode to avoid allocating THP even when they're
> available during the page fault (while still allowing khugepaged to
> collapse hugepages in the background), and then reduce max_ptes_none
> to get the desired promotion ratio.
>

As an aside, a universal benefit would be looking at reducing the time
to allocate the necessary huge page as we know that can be excessive. It
would be ortogonal to this series.

> > <SNIP>
> >
> > Prove me wrong with performance data. :)
>
> Same here.
>

Could you and Kirill outline what sort of workloads you would consider
acceptable for evaluating this series? One would assume it covers at
least the following, potentially with a number of workloads.

1. Evaluate the collapse and copying costs (probing the entire time
spent in collapse_huge_page might do it)
2. Evaluate mmap_sem hold time during hugepage collapse
3. Estimate excessive RAM use due to unnecessary THP usage
4. Estimate the slowdown due to delayed THP usage

1 and 2 would indicate how much time is lost due to not using
reservations. That potentially goes in the direction of simply making
this faster -- fragmentation reduction (posted but unreviewed), faster
compaction searches, better page isolation during compaction to
avoid free pages being reused before an order-9 is free.

3 should be straight-forward but 4 would be the hardest to evaluate
because it would have to be determimed if 4 is offset by improvements to
1-3. If 1-3 is improved enough, it might remove the motivation for the
series entirely.

In other words, if we agree on a workload in advance, it might bring
this the right direction and not accidentally throw Anthony down a hole
working on a series that never gets ack'd.

I'm not necessarily the best person to answer because my natural inclination
after the fragmentation series would be to keep using thpfiosacle
(from the fragmentation avoidance series) and work on improving the THP
allocation success rates and reduce latencies. I've tunnel vision on that
for the moment.

Thanks.

--
Mel Gorman
SUSE Labs