Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

From: Mike Kravetz
Date: Thu Jan 19 2023 - 12:32:45 EST


On 01/19/23 08:57, James Houghton wrote:
> > > > > I wonder if the following crazy idea has already been discussed: treat the
> > > > > whole mapping as a single large logical mapping. One reference and one
> > > > > mapping, no matter how the individual parts are mapped into the assigned
> > > > > page table sub-tree.
> > > > >
> > > > > Because for hugetlb with MAP_SHARED, we know that the complete assigned
> > > > > sub-tree of page tables can only map the given hugetlb page, no fragments of
> > > > > something else. That's very different to THP in private mappings ...
> > > > >
> > > > > So as soon as the first piece gets mapped, we increment refcount+mapcount.
> > > > > Other pieces in the same subtree don't do that.
> > > > >
> > > > > Once the last piece is unmapped (or simpler: once the complete subtree of
> > > > > page tables is gone), we decrement refcount+mapcount. Might require some
> > > > > brain power to do this tracking, but I wouldn't call it impossible right
> > > > > from the start.
> > > > >
> > > > > Would such a design violate other design aspects that are important?
> > >
> > > This is actually how mapcount was treated in HGM RFC v1 (though not
> > > refcount); it is doable for both [2].
> >
> > My apologies for being late to the party :)
> >
> > When Peter first brought up the issue with ref/map_count overflows I was
> > thinking that we should use a scheme like David describes above. As
> > James points out, this was the approach taken in the first RFC.
> >
> > > One caveat here: if a page is unmapped in small pieces, it is
> > > difficult to know if the page is legitimately completely unmapped (we
> > > would have to check all the PTEs in the page table).
> >
> > Are we allowing unmapping of small (non-huge page sized) areas with HGM?
> > We must be if you are concerned with it. What API would cause this?
> > I just do not remember this discussion.
>
> There was some discussion about allowing MADV_DONTNEED on
> less-than-hugepage pieces [3] (it actually motivated the switch from
> UFFD_FEATURE_MINOR_HUGETLBFS_HGM to MADV_SPLIT). It isn't implemented
> in this series, but it could be implemented in the future.

OK, so we do not actually create HGM mappings until a uffd operation is
done at a less than huge page size granularity. MADV_SPLIT just says
that HGM mappings are 'possible' for this vma. Hopefully, my understanding
is correct.

I was concerned about things like the page fault path, but in that case
we have already 'entered HGM mode' via a uffd operation.

Both David and Peter have asked whether eliminating intermediate mapping
levels would be a simplification. I trust your response that it would
not help much in the current design/implementation. But, it did get me
thinking about something else.

Perhaps we have discussed this before, and perhaps it does not meet all
user needs, but one way possibly simplify this is:

- 'Enable HGM' via MADV_SPLIT. Must be done at huge page (hstate)
granularity.
- MADV_SPLIT implicitly unmaps everything with in the range.
- MADV_SPLIT says all mappings for this vma will now be done at a base
(4K) page size granularity. vma would be marked some way.
- I think this eliminates the need for hugetlb_pte's as we KNOW the
mapping size.
- We still use huge pages to back 4K mappings, and we still have to deal
with the ref/map_count issues.
- Code touching hugetlb page tables would KNOW the mapping size up front.

Again, apologies if we talked about and previously dismissed this type
of approach.

> > When I was thinking about this I was a bit concerned about having enough
> > information to know exactly when to inc or dec counts. I was actually
> > worried about knowing to do the increment. I don't recall how it was
> > done in the first RFC, but from a high level it would need to be done
> > when the first hstate level PTE is allocated/added to the page table.
> > Right? My concern was with all the places where we could 'error out'
> > after allocating the PTE, but before initializing it. I was just thinking
> > that we might need to scan the page table or keep metadata for better
> > or easier accounting.
>
> The only two places where we can *create* a high-granularity page
> table are: __mcopy_atomic_hugetlb (UFFDIO_CONTINUE) and
> copy_hugetlb_page_range. RFC v1 did not properly deal with the cases
> where we error out. To correctly handle these cases, we basically have
> to do the pagecache lookup before touching the page table.
>
> 1. For __mcopy_atomic_hugetlb, we can lookup the page before doing the
> PT walk/alloc. If PT walk tells us to inc the page ref/mapcount, we do
> so immediately. We can easily pass the page into
> hugetlb_mcopy_atomic_pte() (via 'pagep') .
>
> 2. For copy_hugetlb_page_range() for VM_MAYSHARE, we can also do the
> lookup before we do the page table walk. I'm not sure how to support
> non-shared HGM mappings with this scheme (in this series, we also
> don't support non-shared; we return -EINVAL).
> NB: The only case where high-granularity mappings for !VM_MAYSHARE
> VMAs would come up is as a result of hwpoison.
>
> So we can avoid keeping additional metadata for what this series is
> trying to accomplish, but if the above isn't acceptable, then I/we can
> try to come up with a scheme that would be acceptable.

Ok, I was thinking we had to deal with other code paths such as page
fault. But, now I understand that is not the case with this design.

> There is also the possibility that the scheme implemented in this
> version of the series is acceptable (i.e., the page_mapcount() API
> difference, which results in slightly modified page migration behavior
> and smaps output, is ok... assuming we have the refcount overflow
> check).
>
> >
> > I think Peter mentioned it elsewhere, we should come up with a workable
> > scheme for HGM ref/map counting. This can be done somewhat independently.
>
> FWIW, what makes the most sense to me right now is to implement the
> THP-like scheme and mark HGM as mutually exclusive with the vmemmap
> optimization. We can later come up with a scheme that lets us retain
> compatibility. (Is that what you mean by "this can be done somewhat
> independently", Mike?)

Sort of, I was only saying that getting the ref/map counting right seems
like a task than can be independently worked. Using the THP-like scheme
is good.
--
Mike Kravetz