Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

From: Peter Xu
Date: Thu Feb 09 2023 - 14:11:49 EST


On Thu, Feb 09, 2023 at 08:43:45AM -0800, James Houghton wrote:
> On Wed, Feb 8, 2023 at 8:16 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Tue, Feb 07, 2023 at 04:26:02PM -0800, James Houghton wrote:
> > > On Tue, Feb 7, 2023 at 3:13 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > >
> > > > James,
> > > >
> > > > On Tue, Feb 07, 2023 at 02:46:04PM -0800, James Houghton wrote:
> > > > > > Here is the result: [1] (sorry it took a little while heh). The
> > > >
> > > > Thanks. From what I can tell, that number shows that it'll be great we
> > > > start with your rfcv1 mapcount approach, which mimics what's proposed by
> > > > Matthew for generic folio.
> > >
> > > Do you think the RFC v1 way is better than doing the THP-like way
> > > *with the additional MMU notifier*?
> >
> > What's the additional MMU notifier you're referring?
>
> An MMU notifier that informs KVM that a collapse has happened without
> having to invalidate_range_start() and invalidate_range_end(), the one
> you're replying to lower down in the email. :) [ see below... ]

Isn't that something that is needed no matter what mapcount approach we'll
go for? Did I miss something?

>
> >
> > >
> > > >
> > > > > > implementation of the "RFC v1" way is pretty horrible[2] (and this
> > > >
> > > > Any more information on why it's horrible? :)
> > >
> > > I figured the code would speak for itself, heh. It's quite complicated.
> > >
> > > I really didn't like:
> > > 1. The 'inc' business in copy_hugetlb_page_range.
> > > 2. How/where I call put_page()/folio_put() to keep the refcount and
> > > mapcount synced up.
> > > 3. Having to check the page cache in UFFDIO_CONTINUE.
> >
> > I think the complexity is one thing which I'm fine with so far. However
> > when I think again about the things behind that complexity, I noticed there
> > may be at least one flaw that may not be trivial to work around.
> >
> > It's about truncation. The problem is now we use the pgtable entry to
> > represent the mapcount, but the pgtable entry cannot be zapped easily,
> > unless vma unmapped or collapsed.
> >
> > It means e.g. truncate_inode_folio() may stop working for hugetlb (of
> > course, with page lock held). The mappings will be removed for real, but
> > not the mapcount for HGM anymore, because unmap_mapping_folio() only zaps
> > the pgtable leaves, not the ones that we used to account for mapcounts.
> >
> > So the kernel may see weird things, like mapcount>0 after
> > truncate_inode_folio() being finished completely.
> >
> > For HGM to do the right thing, we may want to also remove the non-leaf
> > entries when truncating or doing similar things like a rmap walk to drop
> > any mappings for a page/folio. Though that's not doable for now because
> > the locks that truncate_inode_folio() is weaker than what we need to free
> > the pgtable non-leaf entries - we'll need mmap write lock for that, the
> > same as when we unmap or collapse.
> >
> > Matthew's design doesn't have such issue if the ptes need to be populated,
> > because mapcount is still with the leaves; not the case for us here.
> >
> > If that's the case, _maybe_ we still need to start with the stupid but
> > working approach of subpage mapcounts.
>
> Good point. I can't immediately think of a solution. I would prefer to
> go with the subpage mapcount approach to simplify HGM for now;
> optimizing mapcount for HugeTLB can then be handled separately. If
> you're ok with this, I'll go ahead and send v2.

I'm okay with it, but I suggest wait for at least another one day or two to
see whether Mike or others have any comments.

>
> One way that might be possible: using the PAGE_SPECIAL bit on the
> hstate-level PTE to indicate if mapcount has been incremented or not
> (if the PTE is pointing to page tables). As far as I can tell,
> PAGE_SPECIAL doesn't carry any meaning for HugeTLB PTEs, but we would
> need to be careful with existing PTE examination code as to not
> misinterpret these PTEs.

This is an interesting idea. :) Yes I don't see it being used at all in any
pgtable non-leaves.

Then it's about how to let the zap code know when to remove the special
bit, hence the mapcount, because not all of them should.

Maybe it can be passed over as a new zap_flags_t bit?

Thanks,

--
Peter Xu