Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

From: Matthew Wilcox
Date: Mon Oct 05 2020 - 15:38:12 EST


On Mon, Oct 05, 2020 at 03:12:55PM -0400, Zi Yan wrote:
> On 5 Oct 2020, at 11:55, Matthew Wilcox wrote:
> > One of the longer-term todo items is to support variable sized THPs for
> > anonymous memory, just like I've done for the pagecache. With that in
> > place, I think scaling up from PMD sized pages to PUD sized pages starts
> > to look more natural. Itanium and PA-RISC (two architectures that will
> > never be found in phones...) support 1MB, 4MB, 16MB, 64MB and upwards.
> > The RiscV spec you pointed me at the other day confines itself to adding
> > support for 16, 64 & 256kB today, but does note that 8MB, 32MB and 128MB
> > sizes would be possible additions in the future.
>
> Just to understand the todo items clearly. With your pagecache patchset,
> kernel should be able to understand variable sized THPs no matter they
> are anonymous or not, right?

... yes ... modulo bugs and places I didn't fix because only anonymous
pages can get there ;-) There are still quite a few references to
HPAGE_PMD_MASK / SIZE / NR and I couldn't swear that they're all related
to things which are actually PMD sized. I did fix a couple of places
where the anonymous path assumed that pages were PMD sized because I
thought we'd probably want to do that sooner rather than later.

> For anonymous memory, we need kernel policies
> to decide what THP sizes to use at allocation, what to do when under
> memory pressure, and so on. In terms of implementation, THP split function
> needs to support from any order to any lower order. Anything I am missing here?

I think that's the bulk of the work. The swap code also needs work so we
don't have to split pages to swap them out.

> > I think I'm leaning towards not merging this patchset yet. I'm in
> > agreement with the goals (allowing systems to use PUD-sized pages
> > automatically), but I think we need to improve the infrastructure to
> > make it work well automatically. Does that make sense?
>
> I agree that this patchset should not be merged in the current form.
> I think PUD THP support is a part of variable sized THP support, but
> current form of the patchset does not have the “variable sized THP”
> spirit yet and is more like a special PUD case support. I guess some
> changes to existing THP code to make PUD THP less a special case would
> make the whole patchset more acceptable?
>
> Can you elaborate more on the infrastructure part? Thanks.

Oh, this paragraph was just summarising the above. We need to
be consistently using thp_size() instead of HPAGE_PMD_SIZE, etc.
I haven't put much effort yet into supporting pages which are larger than
PMD-size -- that is, if a page is mapped with a PMD entry, we assume
it's PMD-sized. Once we can allocate a larger-than-PMD sized page,
that's off. I assume a lot of that is dealt with in your patchset,
although I haven't audited it to check for that.

> > (*) It would be nice if hardware provided a way to track D/A on a sub-PTE
> > level when using PMD/PUD sized mappings. I don't know of any that does
> > that today.
>
> I agree it would be a nice hardware feature, but it also has a high cost.
> Each TLB would support this with 1024 bits, which is about 16 TLB entry size,
> assuming each entry takes 8B space. Now it becomes why not having a bigger
> TLB. ;)

Oh, we don't have to track at the individual-page level for this to be
useful. Let's take the RISC-V Sv39 page table entry format as an example:

63-54 attributes
53-28 PPN2
27-19 PPN1
18-10 PPN0
9-8 RSW
7-0 DAGUXWRV

For a 2MB page, we currently insist that 18-10 are zero. If we repurpose
eight of those nine bits as A/D bits, we can track at 512kB granularity.
For 1GB pages, we can use 16 of the 18 bits to track A/D at 128MB
granularity. It's not great, but it is quite cheap!