Re: [RFC v2] Support volatile range for anon vma

From: Minchan Kim
Date: Wed Oct 31 2012 - 21:27:15 EST


On Wed, Oct 31, 2012 at 06:22:58PM -0700, Paul Turner wrote:
> On Wed, Oct 31, 2012 at 5:50 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > Hello,
> >
> > On Wed, Oct 31, 2012 at 02:59:07PM -0700, Paul Turner wrote:
> >> On Wed, Oct 31, 2012 at 2:35 PM, Andrew Morton
> >> <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
> >> >
> >> > On Tue, 30 Oct 2012 10:29:54 +0900
> >> > Minchan Kim <minchan@xxxxxxxxxx> wrote:
> >> >
> >> > > This patch introudces new madvise behavior MADV_VOLATILE and
> >> > > MADV_NOVOLATILE for anonymous pages. It's different with
> >> > > John Stultz's version which considers only tmpfs while this patch
> >> > > considers only anonymous pages so this cannot cover John's one.
> >> > > If below idea is proved as reasonable, I hope we can unify both
> >> > > concepts by madvise/fadvise.
> >> > >
> >> > > Rationale is following as.
> >> > > Many allocators call munmap(2) when user call free(3) if ptr is
> >> > > in mmaped area. But munmap isn't cheap because it have to clean up
> >> > > all pte entries and unlinking a vma so overhead would be increased
> >> > > linearly by mmaped area's size.
> >> >
> >> > Presumably the userspace allocator will internally manage memory in
> >> > large chunks, so the munmap() call frequency will be much lower than
> >> > the free() call frequency. So the performance gains from this change
> >> > might be very small.
> >>
> >> I don't think I strictly understand the motivation from a
> >> malloc-standpoint here.
> >>
> >> These days we (tcmalloc) use madvise(..., MADV_DONTNEED) when we want
> >> to perform discards on Linux. For any reasonable allocator (short
> >> of binding malloc --> mmap, free --> unmap) this seems a better
> >> choice.
> >>
> >> Note also from a performance stand-point I doubt any allocator (which
> >> case about performance) is going to want to pay the cost of even a
> >> null syscall about typical malloc/free usage (consider: a tcmalloc
> >
> > Good point.
> >
> >> malloc/free pairis currently <20ns). Given then that this cost is
> >> amortized once you start doing discards on larger blocks MADV_DONTNEED
> >> seems a preferable interface:
> >> - You don't need to reconstruct an arena when you do want to allocate
> >> since there's no munmap/mmap for the region to change about
> >> - There are no syscalls involved in later reallocating the block.
> >
> > Above benefits are applied on MADV_VOLATILE, too.
> > But as you pointed out, there is a little bit overhead than DONTNEED
> > because allocator should call madvise(MADV_NOVOLATILE) before allocation.
> > For mavise(NOVOLATILE) does just mark vma flag, it does need mmap_sem
> > and could be a problem on parallel malloc/free workload as KOSAKI pointed out.
> >
> > In such case, we can change semantic so malloc doesn't need to call
> > madivse(NOVOLATILE) before allocating. Then, page fault handler have to
> > check whether this page fault happen by access of volatile vma. If so,
> > it could return zero page instead of SIGBUS and mark the vma isn't volatile
> > any more.
>
> I think being able to determine whether the backing was discarded
> (about a atomic transition to non-volatile) would be a required
> property to make this useful for non-malloc use-cases.
>

Absolutely.

--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/