Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

From: Zi Yan
Date: Mon Oct 05 2020 - 11:34:20 EST


On 2 Oct 2020, at 3:50, David Hildenbrand wrote:

>>>> - huge page sizes controllable by the userspace?
>>>
>>> It might be good to allow advanced users to choose the page sizes, so they
>>> have better control of their applications.
>>
>> Could you elaborate more? Those advanced users can use hugetlb, right?
>> They get a very good control over page size and pool preallocation etc.
>> So they can get what they need - assuming there is enough memory.
>>
>
> I am still not convinced that 1G THP (TGP :) ) are really what we want
> to support. I can understand that there are some use cases that might
> benefit from it, especially:
>
> "I want a lot of memory, give me memory in any granularity you have, I
> absolutely don't care - but of course, more TGP might be good for
> performance." Say, you want a 5GB region, but only have a single 1GB
> hugepage lying around. hugetlbfs allocation will fail.
>
>
> But then, do we really want to optimize for such (very special?) use
> cases via " 58 files changed, 2396 insertions(+), 460 deletions(-)" ?

I am planning to further refactor my code to reduce the size and make
it more general to support any size of THPs. As Matthew’s patchset[1]
is removing kernel’s THP size assumption, it might be a good time to
make THP support more general.

>
> I think gigantic pages are a sparse resource. Only selected applications
> *really* depend on them and benefit from them. Let these special
> applications handle it explicitly.
>
> Can we have a summary of use cases that would really benefit from this
> change?

For large machine learning applications, 1GB pages give good performance boost[2].
NVIDIA DGX A100 box now has 1TB memory, which means 1GB pages are not
that sparse in GPU-equipped infrastructure[3].

In addition, @Roman Gushchin should be able to provide a more concrete
story from his side.


[1] https://lore.kernel.org/linux-mm/20200908195539.25896-1-willy@xxxxxxxxxxxxx/
[2] http://learningsys.org/neurips19/assets/papers/18_CameraReadySubmission_MLSys_NeurIPS_2019.pdf
[3] https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf


Best Regards,
Yan Zi

Attachment: signature.asc
Description: OpenPGP digital signature