Re: [RFC PATCH 00/16] 1GB THP support on x86_64

From: Michal Hocko
Date: Fri Sep 04 2020 - 03:42:17 EST


On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > From: Zi Yan <ziy@xxxxxxxxxx>
> > >
> > > Hi all,
> > >
> > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > v5.9-rc2-mmots-2020-08-25-21-13.
> > >
> > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > performance of applications with large memory footprint without application
> > > changes compared to hugetlb.
> >
> > Please be more specific about usecases. This better have some strong
> > ones because THP code is complex enough already to add on top solely
> > based on a generic TLB pressure easing.
>
> Hello, Michal!
>
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.

Let me clarify. I am not questioning 1GB (or large) pages in general. I
believe it is quite clear that there are usecases which hugely benefit
from them. I am mostly asking for the transparent part of it which
traditionally means that userspace mostly doesn't have to care and get
them. 2MB THPs have established certain expectations mostly a really
aggressive pro-active instanciation. This has bitten us many times and
create a "you need to disable THP to fix your problem whatever that is"
cargo cult. I hope we do not want to repeat that mistake here again.

> Historically we allocated gigantic pages at the boot time, but recently moved
> to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> see it as a very useful feature.
>
> Given the cost of an allocation, I'm slightly skeptical about an automatic
> heuristics-based approach, but if an application can explicitly mark target areas
> with madvise(), I don't see why it wouldn't work.

An explicit opt-in sounds much more appropriate to me as well. If we go
with a specific API then I would not make it 1GB pages specific. Why
cannot we have an explicit interface to "defragment" address space
range into large pages and the kernel would use large pages where
appropriate? Or is the additional copying prohibitively expensive?
--
Michal Hocko
SUSE Labs