Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

From: David Hildenbrand
Date: Mon Oct 05 2020 - 14:48:19 EST


> The real control of hugetlbfs comes from the interfaces provided by
> the kernel. If kernel provides similar interfaces to control page sizes
> of THPs, it should work the same as hugetlbfs. Mixing page sizes usually
> comes from system memory fragmentation and hugetlbfs does not have this
> mixture because of its special allocation pools not because of the code

With hugeltbfs, you have a guarantee that all pages within your VMA have
the same page size. This is an important property. With THP you have the
guarantee that any page can be operated on, as if it would be base-page
granularity.

Example: KVM on s390x

a) It cannot deal with THP. If you supply THP, the kernel will simply
split up all THP and prohibit new ones from getting formed. All works
well (well, no speedup because no THP).
b) It can deal with 1MB huge pages (in some configurations).
c) It cannot deal with 2G huge pages.

So user space really has to control which pagesize to use in case of
hugetlbfs.

> itself. If THPs are allocated from the same pools, they would act
> the same as hugetlbfs. What am I missing here?

Did I mention that I dislike taking THP from the CMA pool? ;)

>
> I just do not get why hugetlbfs is so special that it can have pagesize
> fine control when normal pages cannot get. The “it should be invisible
> to userpsace” argument suddenly does not hold for hugetlbfs.

It's not about "cannot get", it's about "do we need it". We do have a
trigger "THP yes/no". I wonder in which cases that wouldn't be sufficient.


The name "Transparent" implies that they *should* be transparent to user
space. This, unfortunately, is not completely true:

1. Performance aspects: Breaking up THP is bad for performance. This can
be observed fairly easily by when using 4k-based memory ballooning in
virtualized environments. If we stick to the current THP size (e.g.,
2MB), we are mostly fine. Breaking up 1G THP into 2MB THP when required
is completely acceptable.

2. Wasting memory: Touch a 4K page, get 2M populated. Somewhat
acceptable / controllable. Touch 4K, get 1G populated is not desirable.
And I think we mostly agree that we should operate only on
fully-populated ranges to replace by 1G THP.


But then, there is no observerable difference between 1G THP and 2M THP
from user space point of view except performance.

So we are debating about "Should the kernel tell us that we can use 1G
THP for a VMA". What if we were suddenly to support 2G THP (look at
arm64 how they support all kinds of huge pages for hugetlbfs)? Do we
really need *another* trigger?

What Michal proposed (IIUC) is rather user space telling the kernel
"this large memory range here is *really* important for performance,
please try to optimize the memory layout, give me the best you've got".

MADV_HUGEPAGE_1GB is just ugly.


--
Thanks,

David / dhildenb