Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardownof hugetlbfs shared page tables V2 (resend)

From: Hugh Dickins
Date: Mon Jul 23 2012 - 21:09:12 EST

On Mon, 23 Jul 2012, Mel Gorman wrote:
> On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> > On Fri, 20 Jul 2012, Mel Gorman wrote:
> > > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> I like it in that it's simple and I can confirm it works for the test case
> of interest.

Phew, I'm glad to hear that, thanks.

> However, is your patch not vunerable to truncate issues?
> madvise()/truncate() issues was the main reason why I was wary of VMA tricks
> as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
> ignored for hugetlbfs but I think truncate is still problematic. Lets say
> we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.
> invalidate_inode_pages2
> invalidate_inode_pages2_range
> unmap_mapping_range_vma
> zap_page_range_single
> unmap_single_vma
> __unmap_hugepage_range (removes VM_MAYSHARE)
> The VMA still exists so the consequences for this would be varied but
> minimally fault is going to be "interesting".

You had me worried there, I hadn't considered truncation or invalidation2
at all.

But actually, I don't think they do pose any problem for my patch. They
would indeed if I were removing VM_MAYSHARE in __unmap_hugepage_range()
as you show above; but no, I'm removing it in unmap_hugepage_range().

That's only called by unmap_single_vma(): which is called via
unmap_vmas() by unmap_region() or exit_mmap() just before free_pgtables()
(the problem cases); or by madvise_dontneed() via zap_page_range(), which
as you note is disallowed on VM_HUGETLB; or by zap_page_range_single().

zap_page_range_single() is called by zap_vma_ptes(), which is only
allowed on VM_PFNMAP; or by unmap_mapping_range_vma(), which looked
like it was going to deadlock on i_mmap_mutex (with or without my
patch) until I realized that hugetlbfs has its own hugetlbfs_setattr()
and hugetlb_vmtruncate() which don't use unmap_mapping_range() at all.

invalidate_inode_pages2() (and _range()) do use unmap_mapping_range(),
but hugetlbfs doesn't support direct_IO, and otherwise I think they're
called by a filesystem directly on its own inodes, which hugetlbfs
does not. Anyway, if there's a deadlock on i_mmap_mutex somewhere
in there, it's not introduced by the proposed patch.

So, unmap_hugepage_range() is only being called in the problem cases,
just before free_pgtables(), when unmapping a vma (with mmap_sem held),
or when exiting (when we have the last reference to mm): in each case,
the vma is on its way out, and VM_MAYSHARE no longer of interest to others.

I spent a while concerned that I'd overlooked the truncation case, before
realizing that it's not a problem: the issue comes when we free_pgtables(),
which truncation makes no attempt to do.

So, after a bout of anxiety, I think my &= ~VM_MAYSHARE remains good.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at