Re: THP-enabled filesystem vs. FALLOC_FL_PUNCH_HOLE

From: Kirill A. Shutemov
Date: Fri Mar 04 2016 - 17:48:32 EST


On Fri, Mar 04, 2016 at 11:38:47AM -0800, Hugh Dickins wrote:
> On Fri, 4 Mar 2016, Dave Hansen wrote:
> > On 03/04/2016 03:26 AM, Kirill A. Shutemov wrote:
> > > On Thu, Mar 03, 2016 at 07:51:50PM +0300, Kirill A. Shutemov wrote:
> > >> Truncate and punch hole that only cover part of THP range is implemented
> > >> by zero out this part of THP.
> > >>
> > >> This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE) behaviour.
> > >> As we don't really create hole in this case, lseek(SEEK_HOLE) may have
> > >> inconsistent results depending what pages happened to be allocated.
> > >> Not sure if it should be considered ABI break or not.
> > >
> > > Looks like this shouldn't be a problem. man 2 fallocate:
> > >
> > > Within the specified range, partial filesystem blocks are zeroed,
> > > and whole filesystem blocks are removed from the file. After a
> > > successful call, subsequent reads from this range will return
> > > zeroes.
> > >
> > > It means we effectively have 2M filesystem block size.
> >
> > The question is still whether this will case problems for apps.
> >
> > Isn't 2MB a quote unusual block size? Wouldn't some files on a tmpfs
> > filesystem act like they have a 2M blocksize and others like they have
> > 4k? Would that confuse apps?
>
> At risk of addressing the tip of an iceberg, before diving down to
> scope out the rest of the iceberg...
>
> So far as the behaviour of lseek(,,SEEK_HOLE) goes, I agree with Kirill:
> I don't think it matters to anyone if it skips some zeroed small pages
> within a hugepage. It may cause some artificial tests of holepunch and
> SEEK_HOLE to fail, and it ought to be documented as a limitation from
> choosing to enable THP (Kirill's way) on a filesystem, but I don't think
> it's an ABI break to worry about: anyone who cares just shouldn't enable.
>
> (Though in the case of my huge tmpfs, it's the reverse: the small hole
> punch splits the hugepage; but it's natural that Kirill's way would try
> to hold on to its compound pages for longer than I do, and that's fine
> so long as it's all consistent.)
>
> But I may disagree with "we effectively have 2M filesystem block size",
> beyond the SEEK_HOLE case. If we're emulating hugetlbfs in tmpfs, sure,
> we would have 2M filesystem block size. But if we're enabling THP
> (emphasis on T for Transparent) in tmpfs (or another filesystem), then
> when it matters it must act as if the block size is the 4k (or whatever)
> it usually is. When it matters? Approaching memcg limit or ENOSPC
> spring to mind.
>
> Ah, but suppose someone holepunches out most of each 2M page: they would
> expect the memcg not to be charged for those holes (just as when they
> munmap most of an anonymous THP) - that does suggest splitting is needed.

Hmm.. As split_huge_pages() can fail, we wound need to propagate this
error to userspace. This potentially triggers some other user-visible
effect. EBUSY is not on list of fallocate(2) errror codes.

I think we can invent a way to track if a THP has punch-holed subpages and
prevent the compound page from being mapped as PMD or mapping these
subpages.

But I'm reluctant doing it upfront until real users emerge.

I would propose to see what user demands will be. May be we overthink the
situation.

--
Kirill A. Shutemov