Re: [PATCH 21/46] hugetlb: use struct hugetlb_pte for walk_hugetlb_range

From: James Houghton
Date: Thu Jan 19 2023 - 18:00:03 EST


On Thu, Jan 19, 2023 at 12:53 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Thu, Jan 19, 2023 at 11:42:26AM -0800, James Houghton wrote:
> > - We avoid problems related to compound PTEs (the problem being: two
> > threads racing to populate a contiguous and non-contiguous PTE that
> > take up the same space could lead to user-detectable incorrect
> > behavior. This isn't hard to fix; it will be when I send the arm64
> > patches up.)
>
> Could you elaborate this one a bit more?

In hugetlb_mcopy_atomic_pte(), we check that the PTE we're about to
overwrite is pte_none() before overwriting it. For contiguous PTEs,
this only checks the first PTE in the bunch.

If someone came around and populated one of the PTEs that lied in the
middle of a potentially contiguous group of PTEs, we could end up
overwriting that PTE if we later UFFDIO_CONTINUEd in such a way to
create a contiguous PTE.

We would expect to get EEXIST here, but in this case the operation
would succeed. To fix this, we can just check that ALL the PTEs in the
contiguous bunch have the value that we're expecting, not just the
first one.

hugetlb_no_page() has the same problem, but it's not immediately clear
to me how it would result in incorrect behavior.

>
> > This might seem kind of contrived, but let's say you have a VM with 1T
> > of memory, and you find 100 memory errors all in different 1G pages
> > over the life of this VM (years, potentially). Having 10% of your
> > memory be 4K-mapped is definitely worse than having 10% be 2M-mapped
> > (lost performance and increased memory overhead). There might be other
> > cases in the future where being able to have intermediate mapping
> > sizes could be helpful.
>
> This is not the norm, or is it? How the possibility of bad pages can
> distribute over hosts over years? This can definitely affect how we should
> target the intermediate level mappings.

I can't really speak for norms generally, but I can try to speak for
Google Cloud. Google Cloud hasn't had memory error virtualization for
very long (only about a year), but we've seen cases where VMs can pick
up several memory errors over a few days/weeks. IMO, 100 errors in
separate 1G pages over a few years isn't completely nonsensical,
especially if the memory that you're using isn't so reliable or was
damaged in shipping (like if it was flown over the poles or
something!).

Now there is the concern about how an application would handle it. In
a VMM's case, we can virtualize the error for the guest. In the guest,
it's possible that a good chunk of the errors lie in unused pages and
so can be easily marked as poisoned. It's possible that recovery is
much more difficult. It's not unreasonable for an application to
recover from a lot of memory errors.

- James