On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote:
On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
Right, and thanks for the git digging as usual. I would agree hugetlb is
more challenge than many other modules on git archaeology. :)
Even if I mentioned the invalidate_lock, I don't think I thought deeper
than that. I just wished whenever possible we still move hugetlb code
closer to generic code, so if that's the goal we may still want to one day
have a closer look at whether hugetlb can also use invalidate_lock. Maybe
it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
normally at least allows concurrent fault, but that's currently what isn't
allowed in hugetlb anyway..
If we start to remove finer grained locks that work will be even harder,
and removing folio lock in this case in fault path also brings hugetlbfs
even further from other file systems. That might be slightly against what
we used to wish to do, which is to make it closer to others. Meanwhile I'm
also not yet sure the benefit of not taking folio lock all across, e.g. I
don't expect perf would change at all even if lock is avoided. We may want
to think about that too when doing so.
Ok, I have to confess I was not looking things from this perspective,
but when doing so, yes, you are right, we should strive to find
replacements wherever we can for not using hugetlb-specific code.
I do not know about this case though, not sure what other options do we
have when trying to shut concurrent faults while doing other operation.
But it is something we should definitely look at.
Wrt. to the lock.
There were two locks, old_folio (taken in hugetlb_fault) and
pagecache_folio one.
There're actually three places this patch touched, the 3rd one is
hugetlb_no_page(), in which case I also think we should lock it, not only
because file folios normally does it (see do_fault(), for example), but
also that's exactly what James mentioned I believe on possible race of
!uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:
folio = alloc_hugetlb_folio(vma, vmf->address, false);
...
folio_zero_user(folio, vmf->real_address);
__folio_mark_uptodate(folio);
The thing was not about worry as how much perf we leave on the table
because of these locks, as I am pretty sure is next to 0, but my drive
was to understand what are protection and why, because as the discussion
showed, none of us really had a good idea about it and it turns out that this
goes back more than ~20 years ago.
Another topic for the lock (old_folio, so the one we copy from),
when we compare it to generic code, we do not take the lock there.
Looking at do_wp_page(), we do __get__ a reference on the folio we copy
from, but not the lock, so AFAIU, the lock seems only to please
Yes this is a good point; for CoW path alone maybe we don't need to lock
old_folio.
folio_move_anon_rmap() from hugetlb_wp.
Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
calls folio_move_anon_rmap() in case we can re-use the folio, it only
takes the lock before the call to folio_move_anon_rmap(), and then
unlocks it.
IMHO, do_wp_page() took the folio lock not for folio_move_anon_rmap(), but
for checking swapcache/ksm stuff which needs to be serialized with folio
lock.
So I'm not 100% confident on the folio_move_anon_rmap(), but I _think_ it
deserves a data_race() and IIUC it only work not because of the folio lock,
but because of how anon_vma is managed as a tree as of now, so that as long
as WRITE_ONCE() even a race is benign (because the rmap walker will either
see a complete old anon_vma that includes the parent process's anon_vma, or
the child's). What really protects the anon_vma should really be anon_vma
lock.. That can definitely be a separate topic. I'm not sure whether you'd
like to dig this part out, but if you do I'd also be more than happy to
know whether my understanding needs correction here.. :)
In general, I still agree with you that if hugetlb CoW path can look closer
to do_wp_page then it's great.