Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page

From: Mike Kravetz
Date: Wed Aug 24 2022 - 19:35:28 EST


On 08/24/22 17:41, Baolin Wang wrote:
>
>
> On 8/24/2022 3:31 PM, David Hildenbrand wrote:
> > > > > >
> > > > > > IMHO, these follow_huge_xxx() functions are arch-specified at first and
> > > > > > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm:
> > > > > > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are
> > > > > > still some arch-specified follow_huge_xxx() definition, for example:
> > > > > > ia64: follow_huge_addr
> > > > > > powerpc: follow_huge_pd
> > > > > > s390: follow_huge_pud
> > > > > >
> > > > > > What I mean is that follow_hugetlb_page() is a common and
> > > > > > not-arch-specified function, is it suitable to change it to be
> > > > > > arch-specified?
> > > > > > And thinking more, can we rename follow_hugetlb_page() as
> > > > > > hugetlb_page_faultin() and simplify it to only handle the page faults of
> > > > > > hugetlb like the faultin_page() for normal page? That means we can make
> > > > > > sure only follow_page_mask() can handle hugetlb.
> > > > > >
> > > >
> > > > Something like that might work, but you still have two page table walkers
> > > > for hugetlb. I like David's idea (if I understand it correctly) of
> > >
> > > What I mean is we may change the hugetlb handling like normal page:
> > > 1) use follow_page_mask() to look up a hugetlb firstly.
> > > 2) if can not get the hugetlb, then try to page fault by
> > > hugetlb_page_faultin().
> > > 3) if page fault successed, then retry to find hugetlb by
> > > follow_page_mask().
> >
> > That implies putting more hugetlbfs special code into generic GUP,
> > turning it even more complicated. But of course, it depends on how the
> > end result looks like. My gut feeling was that hugetlb is better handled
> > in follow_hugetlb_page() separately (just like we do with a lot of other
> > page table walkers).
>
> OK, fair enough.
>
> > >
> > > Just a rough thought, and I need more investigation for my idea and
> > > David's idea.
> > >
> > > > using follow_hugetlb_page for both cases. As noted, it will need to be
> > > > taught how to not trigger faults in the follow_page_mask case.
> > >
> > > Anyway, I also agree we need some cleanup, and firstly I think we should
> > > cleanup these arch-specified follow_huge_xxx() on some architectures
> > > which are similar with the common ones. I will look into these.
> >
> > There was a recent discussion on that, e.g.:
> >
> > https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad
>
> Thanks.
>
> >
> > >
> > > However, considering cleanup may need more investigation and
> > > refactoring, now I prefer to make these bug-fix patches of this patchset
> > > into mainline firstly, which are suitable to backport to old version to
> > > fix potential race issues. Mike and David, how do you think? Could you
> > > help to review these patches? Thanks.
> >
> > Patch #1 certainly add more special code just to handle another hugetlb
> > corner case (CONT pages), and maybe just making it all use
> > follow_hugetlb_page() would be even cleaner and less error prone.
> >
> > I agree that locking is shaky, but I'm not sure if we really want to
> > backport this to stable trees:
> >
> > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> >
> > "It must fix a real bug that bothers people (not a, “This could be a
> > problem...” type thing)."
> >
> >
> > Do we actually have any instance of this being a real (and not a
> > theoretical) problem? If not, I'd rather clean it all up right away.
>
> I think this is a real problem (not theoretical), and easy to write some
> code to show the issue. For example, suppose thread A is trying to look up a
> CONT-PTE size hugetlb page under the lock, however antoher thread B can
> migrate the CONT-PTE hugetlb page at the same time, which will cause thread
> A to get an incorrect page, if thread A want to do something for this
> incorrect page, error occurs.

Is the primary concern the locking? If so, I am not sure we have an issue.
As mentioned in your commit message, current code will use
pte_offset_map_lock(). pte_offset_map_lock uses pte_lockptr, and pte_lockptr
will either be the mm wide lock or pmd_page lock. To me, it seems that
either would provide correct synchronization for CONT-PTE entries. Am I
missing something or misreading the code?

I started looking at code cleanup suggested by David. Here is a quick
patch (not tested and likely containing errors) to see if this is a step
in the right direction.

I like it because we get rid of/combine all those follow_huge_p*d
routines.