Re: [PATCH -alternative] mm: hugetlbfs: Close race during teardownof hugetlbfs shared page tables V2 (resend)

From: Mel Gorman
Date: Mon Jul 23 2012 - 07:40:20 EST

On Sun, Jul 22, 2012 at 09:04:33PM -0700, Hugh Dickins wrote:
> On Fri, 20 Jul 2012, Mel Gorman wrote:
> > On Fri, Jul 20, 2012 at 04:36:35PM +0200, Michal Hocko wrote:
> > > And here is my attempt for the fix (Hugh mentioned something similar
> > > earlier but he suggested using special flags in ptes or VMAs). I still
> > > owe doc. update and it hasn't been tested with too many configs and I
> > > could missed some definition updates.
> > > I also think that changelog could be much better, I will add (steal) the
> > > full bug description if people think that this way is worth going rather
> > > than the one suggested by Mel.
> > > To be honest I am not quite happy how I had to pollute generic mm code with
> > > something that is specific to a single architecture.
> > > Mel hammered it with the test case and it survived.
> >
> > Tested-by: Mel Gorman <mgorman@xxxxxxx>
> >
> > This approach looks more or less like what I was expecting. I like that
> > the trick was applied to the page table page instead of using PTE tricks
> > or by bodging it with a VMA flag like I was thinking so kudos for that. I
> > also prefer this approach to trying to free the page tables on or near
> > huge_pmd_unshare()
> >
> > In general I think this patch would execute better than mine because it is
> > far less heavy-handed but I share your concern that it changes the core MM
> > quite a bit for a corner case that only one architecture cares about. I am
> > completely biased of course, but I still prefer my patch because other than
> > an API change it keeps the bulk of the madness in arch/x86/mm/hugetlbpage.c
> > . I am also not concerned with the scalability of how quickly we can setup
> > page table sharing.
> >
> > Hugh, I'm afraid you get to choose :)
> Thank you bestowing that honour upon me :)

Just so you know, there was a ceremonial gong when it happened.

> Seriously, though, you
> were quite right to Cc me on this, it is one of those areas I ought
> to know something about (unlike hugetlb reservations, for example).
> Please don't be upset if I say that I don't like either of your patches.

I can live with that :) It would not be the first time we found the right
patch out of dislike for the first proposed and getting the fix is what's

> Mainly for obvious reasons - I don't like Mel's because anything with
> trylock retries and nested spinlocks worries me before I can even start
> to think about it;

That's a reasonable objection. The trylock could be avoided by always
falling through at the cost of reducing the amount of sharing
opportunities but the nested locking is unavoidable. I agree with you
that nested locking like this should always be a cause for concern.

> and I don't like Michal's for the same reason as Mel,
> that it spreads more change around in common paths than we would like.
> But I didn't spend much time thinking through either of them, they just
> seemed more complicated than should be needed. I cannot confirm or deny
> whether they're correct - though I still do not understand how mmap_sem
> can help you, Mel. I can see that it will help in your shmdt()ing test,
> but if you leave the area mapped on exit, then mmap_sem is not taken in
> the exit_mmap() path, so how does it help?

It certainly helps in the shmdt case which is what the test case focused
on because that is what the application that triggered this bug was
doing. However, you're right in that the exit_mmap() path is still vunerable
because it does not take mmap_sem. I'll think about that a bit more.

> I spent hours trying to dream up a better patch, trying various
> approaches. I think I have a nice one now, what do you think? And
> more importantly, does it work? I have not tried to test it at all,
> that I'm hoping to leave to you, I'm sure you'll attack it with gusto!
> If you like it, please take it over and add your comments and signoff
> and send it in.

I like it in that it's simple and I can confirm it works for the test case
of interest.

However, is your patch not vunerable to truncate issues?
madvise()/truncate() issues was the main reason why I was wary of VMA tricks
as a solution. As it turns out, madvise(DONTNEED) is not a problem as it is
ignored for hugetlbfs but I think truncate is still problematic. Lets say
we mmap(MAP_SHARED) a hugetlbfs file and then truncate for whatever reason.

__unmap_hugepage_range (removes VM_MAYSHARE)

The VMA still exists so the consequences for this would be varied but
minimally fault is going to be "interesting".

I think that potentially we could work around this but it may end up stomping
on the core MM and not necessarily be any better than Michal's patch.

> The second part won't come up in your testing, and could
> be made a separate patch if you prefer: it's a related point that struck
> me while I was playing with a different approach.

I'm fine with the second part.

Mel Gorman
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at