Re: [RFC PATCH 00/16] 1GB THP support on x86_64

From: Jason Gunthorpe
Date: Thu Sep 03 2020 - 12:40:52 EST


On Wed, Sep 02, 2020 at 04:29:46PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:
>
> > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
> >>
> >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >>>
> >>>>> Surprised this doesn't touch mm/pagewalk.c ?
> >>>>
> >>>> 1GB PUD page support is present for DAX purpose, so the code is there
> >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >>>> the functions in mm/pagewalk.c. :)
> >>>
> >>> Yes, but doesn't this change what is possible under the mmap_sem
> >>> without the page table locks?
> >>>
> >>> ie I would expect some thing like pmd_trans_unstable() to be required
> >>> as well for lockless walkers. (and I don't think the pmd code is 100%
> >>> right either)
> >>>
> >>
> >> Right. I missed that. Thanks for pointing it out.
> >> The code like this, right?
> >
> > Technically all those *pud's are racy too, the design here with the
> > _unstable function call always seemed weird. I strongly suspect it
> > should mirror how get_user_pages_fast works for lockless walking
>
> You mean READ_ONCE on page table entry pointer first, then use the value
> for the rest of the loop? I am not quite familiar with this racy check
> part of the code and happy to hear more about it.

There are two main issues with the THPs and lockless walks

- The *pXX value can change at any time, as THPs can be split at any
moment. However, once observed to be a sub page table pointer the
value is fixed under the read side of the mmap (I think, I never
did find the code path supporting this, but everything is busted if
it isn't true...)

- Reading the *pXX without load tearing is difficult on 32 bit arches

So if you do READ_ONCE() it defeats the first problem.

However if the sizeof(*pXX) is 8 on a 32 bit platform then load
tearing is a problem. At lest the various pXX_*() test functions
operate on a single 32 bit word so don't tear, but to to convert the
*pXX to a lower level page table pointer a coherent, untorn, read is
required.

So, looking again, I remember now, I could never quite figure out why
gup_pmd_range() was safe to do:

pmd_t pmd = READ_ONCE(*pmdp);
[..]
} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
[..]
ptem = ptep = pte_offset_map(&pmd, addr);

As I don't see what prevents load tearing a 64 bit pmd.. Eg no
pmd_trans_unstable() or equivalent here.

But we see gup_get_pte() using an anti-load tearing technique..

Jason